Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to integrate util.filter_spans in nlp.pipe() ? - ValueError: [E102] Can't merge non-disjoint spans. #5393

Closed
MoritzLaurer opened this issue May 3, 2020 · 8 comments · Fixed by #5470
Labels
bug Bugs and behaviour differing from documentation feat / pipeline Feature: Processing pipeline and components lang / en English language data and models

Comments

@MoritzLaurer
Copy link

Hi, I've added nlp.create_pipe("merge_noun_chunks") to my nlp pipeline as described here: https://spacy.io/api/pipeline-functions.
When I run the nlp pipeline on large amounts of text, I sometimes get the following error. (for some corpuses I get the error, for others I dont - probably depending on random sentences).

ValueError: [E102] Can't merge non-disjoint spans. 'online' is already part of tokens to merge. If you want to find the longest non-overlapping spans, you can use the util.filter_spans helper:
https://spacy.io/api/top-level#util.filter_spans

I saw in other issues (e.g. #3687), that this can be solved with the util.filter_spans function, but I don't understand how to integrate this helper function in an nlp.pipe pipeline.

Thanks for your advice :)

How to reproduce the behaviour

nlp = spacy.load('en_core_web_sm')
nlp.add_pipe(nlp.create_pipe("merge_noun_chunks"))
        
docs = []
for doc, context in nlp.pipe(context_tpl_lst, as_tuples=True, n_process=1):
                doc._.Date = context["Date"]
                doc._.Category = context["Category"]
                doc._.ID = context["ID"] 
                docs.append(doc)

(Unfortunately I can't give you a specific string or context_tpl_lst object, because I don't know which sentence in my corpus is causing the error)

Your Environment

  • spaCy version: 2.2.3
  • Platform: Darwin-19.4.0-x86_64-i386-64bit
  • Python version: 3.7.6
@adrianeboyd adrianeboyd added feat / pipeline Feature: Processing pipeline and components usage General spaCy usage labels May 4, 2020
@adrianeboyd
Copy link
Contributor

Hmm, I actually suspect something is going wrong with noun_chunks, since it shouldn't return chunks that overlap in the first place. Figuring out which sentence causes problems would be a big help in debugging this on our end, so would it be possible (it will be slower, obviously) to use plain nlp with a try/except block to try to figure out which sentence leads to the error? Something like:

for text in texts:
    try:
        doc = nlp(text)
    except ValueError as e:
        print(e)
        print(text)

Also, the output of spacy validate would be helpful to have the exact model version so we can try to reproduce this on our end.

@adrianeboyd adrianeboyd added the more-info-needed This issue needs more information label May 8, 2020
@MoritzLaurer
Copy link
Author

Hi Adriane, thanks for your response.

I filtered out a few sentences and errors like this:

nlp = spacy.load('en_core_web_sm')
nlp.add_pipe(nlp.create_pipe("merge_noun_chunks"))

docs = []
error_lst = []
error_text_lst = []
for text in texts: 
    try: 
        doc = nlp(text)
        docs.append(doc)
    except ValueError as e: 
        error_lst.append(e)
        error_text_lst.append(text)
        print("Error found in sentence: " + text)

It found 12 errors in 73327 texts.

Here are the texts that caused the errors:

['\n\nDraft conclusions for this Thursday and Friday’s meeting of EU leaders circulated this week among national capitals listing four headline items: “migration,” “security,” “jobs, growth and competitiveness,” and “U.K.” The wording hints at pre-agreement that Europe “needs a balanced and geographically comprehensive approach to migration”; that it should move forward with “strategic reflection” on foreign and security policy; and that the EU needs to promote growth in digital technologies.',
'\n\nTensions between the EU and Ankara have increased since a failed coup in summer 2016 to which Ankara reacted by arresting judges, journalists and opposition figures.',
'\n\nBarnier, the Commission president added, "has an extensive network of contacts in the capitals of all EU member states and in the European Parliament, which I consider a valuable asset for this function."',
'\n\nYet the recent revelation that the Parliament has asked the EU’s anti-fraud investigator OLAF to examine how France’s far-right Front National (FN) had made use of assistants working for the party’s 24 MEPs has again raised questions about how the staffing arrangements are managed.',
'The article also states that it doesn’t undermine commitments under the North Atlantic Treaty Organisation (NATO), which some, though not all, EU member states are also part of.',
'It can only be activated if the emergencies “are of such wide-ranging impact or political significance that they require a co-ordinated EU response on a political level”, the introduction to the 2008 crisis manual says.',
'\n\nCertainly the adventures of Captain Euro and his chums are set to capture the hearts of EU-doubters everywhere and prove for once and for all that the Union is a safe haven from the evils of the world outside.',
'Open Europe shares certain features with Itinera, of which Cleppe lists three: independence, which means not taking any funding from governments or the EU and instead relying on individual donations, primarily from business people; an attitude towards the EU founded on “healthy, constructive criticism”; and the determination not to get trapped in consensus views of critical issues.',
'\n\nDelivery failure\n\nOne message Tocci extracts from her case studies – about which she offers many interesting observations – is that EU involvement has fallen far short of its potential in almost every instance.',
'\n\nThe EU is next week to propose changes to the way that member states can use a EU database holding fingerprints of asylum-seekers.',
'He is ready to make bilateral agreements but engaging with the institutions of the EU is something for which he has lost any appetite.',
'\n\nThat is particularly vital in light of the fact that they will soon embark on another venture about which ordinary EU citizens have grave misgivings.']

Here are the respective errors:

[ValueError("[E102] Can't merge non-disjoint spans. 'Europe' is already part of tokens to merge. If you want to find the longest non-overlapping spans, you can use the util.filter_spans helper:\nhttps://spacy.io/api/top-level#util.filter_spans"),
ValueError("[E102] Can't merge non-disjoint spans. 'Ankara' is already part of tokens to merge. If you want to find the longest non-overlapping spans, you can use the util.filter_spans helper:\nhttps://spacy.io/api/top-level#util.filter_spans"),
ValueError("[E102] Can't merge non-disjoint spans. 'I' is already part of tokens to merge. If you want to find the longest non-overlapping spans, you can use the util.filter_spans helper:\nhttps://spacy.io/api/top-level#util.filter_spans"),
ValueError("[E102] Can't merge non-disjoint spans. 'the' is already part of tokens to merge. If you want to find the longest non-overlapping spans, you can use the util.filter_spans helper:\nhttps://spacy.io/api/top-level#util.filter_spans"),
ValueError("[E102] Can't merge non-disjoint spans. 'EU' is already part of tokens to merge. If you want to find the longest non-overlapping spans, you can use the util.filter_spans helper:\nhttps://spacy.io/api/top-level#util.filter_spans"),
ValueError("[E102] Can't merge non-disjoint spans. 'they' is already part of tokens to merge. If you want to find the longest non-overlapping spans, you can use the util.filter_spans helper:\nhttps://spacy.io/api/top-level#util.filter_spans"),
ValueError("[E102] Can't merge non-disjoint spans. 'the' is already part of tokens to merge. If you want to find the longest non-overlapping spans, you can use the util.filter_spans helper:\nhttps://spacy.io/api/top-level#util.filter_spans"),
ValueError("[E102] Can't merge non-disjoint spans. 'Cleppe' is already part of tokens to merge. If you want to find the longest non-overlapping spans, you can use the util.filter_spans helper:\nhttps://spacy.io/api/top-level#util.filter_spans"),
ValueError("[E102] Can't merge non-disjoint spans. 'she' is already part of tokens to merge. If you want to find the longest non-overlapping spans, you can use the util.filter_spans helper:\nhttps://spacy.io/api/top-level#util.filter_spans"),
ValueError("[E102] Can't merge non-disjoint spans. 'member' is already part of tokens to merge. If you want to find the longest non-overlapping spans, you can use the util.filter_spans helper:\nhttps://spacy.io/api/top-level#util.filter_spans"),
ValueError("[E102] Can't merge non-disjoint spans. 'he' is already part of tokens to merge. If you want to find the longest non-overlapping spans, you can use the util.filter_spans helper:\nhttps://spacy.io/api/top-level#util.filter_spans"),
ValueError("[E102] Can't merge non-disjoint spans. 'ordinary' is already part of tokens to merge. If you want to find the longest non-overlapping spans, you can use the util.filter_spans helper:\nhttps://spacy.io/api/top-level#util.filter_spans")]

Regarding spacy validate:
I ran it in the the virtual conda environment I'm using and got the output:

====================== Installed models (spaCy v2.2.3) ======================

ℹ spaCy installation:
/Users/moritzlaurer/anaconda3/envs/env_nlp/lib/python3.7/site-packages/spacy

No models found in your current environment.

I suppose it's taking the model from outside the conda env, so I ran spacy validate outside the conda env and got the output:

====================== Installed models (spaCy v2.2.3) ======================
ℹ spaCy installation:
/Users/moritzlaurer/anaconda3/lib/python3.7/site-packages/spacy

TYPE NAME MODEL VERSION
package en-core-web-sm en_core_web_sm 2.2.5 ✔
package en-core-web-md en_core_web_md 2.2.5 ✔
package en-core-web-lg en_core_web_lg 2.2.5 ✔

Hope this helps :)

@no-response no-response bot removed the more-info-needed This issue needs more information label May 9, 2020
@adrianeboyd
Copy link
Contributor

Thanks for the examples! I can replicate this with v2.2.3 and v2.2.4, but not with master (all with the same model), which I guess is a good sign overall, but I don't know which changes have affected the results, since this code hasn't changed much recently. We will look into it...

@adrianeboyd adrianeboyd added bug Bugs and behaviour differing from documentation lang / en English language data and models and removed usage General spaCy usage labels May 14, 2020
@adrianeboyd
Copy link
Contributor

An additional text from #5458:

text = "In an era where markets have brought prosperity and empowerment, this leader clings to a bankrupt ideology that has brought Cuba's workers and farmers and families nothing -- nothing -- but isolation and misery."

@MoritzLaurer
Copy link
Author

thanks for fixing this :) @honnibal @svlandeg @adrianeboyd

@lock
Copy link

lock bot commented Jun 24, 2020

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

2 similar comments
@lock
Copy link

lock bot commented Jun 24, 2020

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock
Copy link

lock bot commented Jun 24, 2020

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Jun 24, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Bugs and behaviour differing from documentation feat / pipeline Feature: Processing pipeline and components lang / en English language data and models
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants