How to integrate util.filter_spans in nlp.pipe() ? - ValueError: [E102] Can't merge non-disjoint spans. #5393

MoritzLaurer · 2020-05-03T13:25:23Z

Hi, I've added nlp.create_pipe("merge_noun_chunks") to my nlp pipeline as described here: https://spacy.io/api/pipeline-functions.
When I run the nlp pipeline on large amounts of text, I sometimes get the following error. (for some corpuses I get the error, for others I dont - probably depending on random sentences).

ValueError: [E102] Can't merge non-disjoint spans. 'online' is already part of tokens to merge. If you want to find the longest non-overlapping spans, you can use the util.filter_spans helper:
https://spacy.io/api/top-level#util.filter_spans

I saw in other issues (e.g. #3687), that this can be solved with the util.filter_spans function, but I don't understand how to integrate this helper function in an nlp.pipe pipeline.

Thanks for your advice :)

How to reproduce the behaviour

nlp = spacy.load('en_core_web_sm')
nlp.add_pipe(nlp.create_pipe("merge_noun_chunks"))
        
docs = []
for doc, context in nlp.pipe(context_tpl_lst, as_tuples=True, n_process=1):
                doc._.Date = context["Date"]
                doc._.Category = context["Category"]
                doc._.ID = context["ID"] 
                docs.append(doc)

(Unfortunately I can't give you a specific string or context_tpl_lst object, because I don't know which sentence in my corpus is causing the error)

Your Environment

spaCy version: 2.2.3
Platform: Darwin-19.4.0-x86_64-i386-64bit
Python version: 3.7.6

The text was updated successfully, but these errors were encountered:

adrianeboyd · 2020-05-04T18:08:59Z

Hmm, I actually suspect something is going wrong with noun_chunks, since it shouldn't return chunks that overlap in the first place. Figuring out which sentence causes problems would be a big help in debugging this on our end, so would it be possible (it will be slower, obviously) to use plain nlp with a try/except block to try to figure out which sentence leads to the error? Something like:

for text in texts:
    try:
        doc = nlp(text)
    except ValueError as e:
        print(e)
        print(text)

Also, the output of spacy validate would be helpful to have the exact model version so we can try to reproduce this on our end.

MoritzLaurer · 2020-05-09T10:08:08Z

Hi Adriane, thanks for your response.

I filtered out a few sentences and errors like this:

nlp = spacy.load('en_core_web_sm')
nlp.add_pipe(nlp.create_pipe("merge_noun_chunks"))

docs = []
error_lst = []
error_text_lst = []
for text in texts: 
    try: 
        doc = nlp(text)
        docs.append(doc)
    except ValueError as e: 
        error_lst.append(e)
        error_text_lst.append(text)
        print("Error found in sentence: " + text)

It found 12 errors in 73327 texts.

Here are the texts that caused the errors:

['\n\nDraft conclusions for this Thursday and Friday’s meeting of EU leaders circulated this week among national capitals listing four headline items: “migration,” “security,” “jobs, growth and competitiveness,” and “U.K.” The wording hints at pre-agreement that Europe “needs a balanced and geographically comprehensive approach to migration”; that it should move forward with “strategic reflection” on foreign and security policy; and that the EU needs to promote growth in digital technologies.',
'\n\nTensions between the EU and Ankara have increased since a failed coup in summer 2016 to which Ankara reacted by arresting judges, journalists and opposition figures.',
'\n\nBarnier, the Commission president added, "has an extensive network of contacts in the capitals of all EU member states and in the European Parliament, which I consider a valuable asset for this function."',
'\n\nYet the recent revelation that the Parliament has asked the EU’s anti-fraud investigator OLAF to examine how France’s far-right Front National (FN) had made use of assistants working for the party’s 24 MEPs has again raised questions about how the staffing arrangements are managed.',
'The article also states that it doesn’t undermine commitments under the North Atlantic Treaty Organisation (NATO), which some, though not all, EU member states are also part of.',
'It can only be activated if the emergencies “are of such wide-ranging impact or political significance that they require a co-ordinated EU response on a political level”, the introduction to the 2008 crisis manual says.',
'\n\nCertainly the adventures of Captain Euro and his chums are set to capture the hearts of EU-doubters everywhere and prove for once and for all that the Union is a safe haven from the evils of the world outside.',
'Open Europe shares certain features with Itinera, of which Cleppe lists three: independence, which means not taking any funding from governments or the EU and instead relying on individual donations, primarily from business people; an attitude towards the EU founded on “healthy, constructive criticism”; and the determination not to get trapped in consensus views of critical issues.',
'\n\nDelivery failure\n\nOne message Tocci extracts from her case studies – about which she offers many interesting observations – is that EU involvement has fallen far short of its potential in almost every instance.',
'\n\nThe EU is next week to propose changes to the way that member states can use a EU database holding fingerprints of asylum-seekers.',
'He is ready to make bilateral agreements but engaging with the institutions of the EU is something for which he has lost any appetite.',
'\n\nThat is particularly vital in light of the fact that they will soon embark on another venture about which ordinary EU citizens have grave misgivings.']

Here are the respective errors:

[ValueError("[E102] Can't merge non-disjoint spans. 'Europe' is already part of tokens to merge. If you want to find the longest non-overlapping spans, you can use the util.filter_spans helper:\nhttps://spacy.io/api/top-level#util.filter_spans"),
ValueError("[E102] Can't merge non-disjoint spans. 'Ankara' is already part of tokens to merge. If you want to find the longest non-overlapping spans, you can use the util.filter_spans helper:\nhttps://spacy.io/api/top-level#util.filter_spans"),
ValueError("[E102] Can't merge non-disjoint spans. 'I' is already part of tokens to merge. If you want to find the longest non-overlapping spans, you can use the util.filter_spans helper:\nhttps://spacy.io/api/top-level#util.filter_spans"),
ValueError("[E102] Can't merge non-disjoint spans. 'the' is already part of tokens to merge. If you want to find the longest non-overlapping spans, you can use the util.filter_spans helper:\nhttps://spacy.io/api/top-level#util.filter_spans"),
ValueError("[E102] Can't merge non-disjoint spans. 'EU' is already part of tokens to merge. If you want to find the longest non-overlapping spans, you can use the util.filter_spans helper:\nhttps://spacy.io/api/top-level#util.filter_spans"),
ValueError("[E102] Can't merge non-disjoint spans. 'they' is already part of tokens to merge. If you want to find the longest non-overlapping spans, you can use the util.filter_spans helper:\nhttps://spacy.io/api/top-level#util.filter_spans"),
ValueError("[E102] Can't merge non-disjoint spans. 'the' is already part of tokens to merge. If you want to find the longest non-overlapping spans, you can use the util.filter_spans helper:\nhttps://spacy.io/api/top-level#util.filter_spans"),
ValueError("[E102] Can't merge non-disjoint spans. 'Cleppe' is already part of tokens to merge. If you want to find the longest non-overlapping spans, you can use the util.filter_spans helper:\nhttps://spacy.io/api/top-level#util.filter_spans"),
ValueError("[E102] Can't merge non-disjoint spans. 'she' is already part of tokens to merge. If you want to find the longest non-overlapping spans, you can use the util.filter_spans helper:\nhttps://spacy.io/api/top-level#util.filter_spans"),
ValueError("[E102] Can't merge non-disjoint spans. 'member' is already part of tokens to merge. If you want to find the longest non-overlapping spans, you can use the util.filter_spans helper:\nhttps://spacy.io/api/top-level#util.filter_spans"),
ValueError("[E102] Can't merge non-disjoint spans. 'he' is already part of tokens to merge. If you want to find the longest non-overlapping spans, you can use the util.filter_spans helper:\nhttps://spacy.io/api/top-level#util.filter_spans"),
ValueError("[E102] Can't merge non-disjoint spans. 'ordinary' is already part of tokens to merge. If you want to find the longest non-overlapping spans, you can use the util.filter_spans helper:\nhttps://spacy.io/api/top-level#util.filter_spans")]

Regarding spacy validate:
I ran it in the the virtual conda environment I'm using and got the output:

====================== Installed models (spaCy v2.2.3) ======================

ℹ spaCy installation:
/Users/moritzlaurer/anaconda3/envs/env_nlp/lib/python3.7/site-packages/spacy

No models found in your current environment.

I suppose it's taking the model from outside the conda env, so I ran spacy validate outside the conda env and got the output:

====================== Installed models (spaCy v2.2.3) ======================
ℹ spaCy installation:
/Users/moritzlaurer/anaconda3/lib/python3.7/site-packages/spacy

TYPE NAME MODEL VERSION
package en-core-web-sm en_core_web_sm 2.2.5 ✔
package en-core-web-md en_core_web_md 2.2.5 ✔
package en-core-web-lg en_core_web_lg 2.2.5 ✔

Hope this helps :)

adrianeboyd · 2020-05-14T09:56:29Z

Thanks for the examples! I can replicate this with v2.2.3 and v2.2.4, but not with master (all with the same model), which I guess is a good sign overall, but I don't know which changes have affected the results, since this code hasn't changed much recently. We will look into it...

adrianeboyd · 2020-05-19T14:25:53Z

An additional text from #5458:

text = "In an era where markets have brought prosperity and empowerment, this leader clings to a bankrupt ideology that has brought Cuba's workers and farmers and families nothing -- nothing -- but isolation and misery."

MoritzLaurer · 2020-05-22T07:47:20Z

thanks for fixing this :) @honnibal @svlandeg @adrianeboyd

lock · 2020-06-24T22:45:42Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

lock · 2020-06-24T22:50:06Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

lock · 2020-06-24T22:50:33Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

adrianeboyd added feat / pipeline Feature: Processing pipeline and components usage General spaCy usage labels May 4, 2020

adrianeboyd added the more-info-needed This issue needs more information label May 8, 2020

no-response bot removed the more-info-needed This issue needs more information label May 9, 2020

adrianeboyd added bug Bugs and behaviour differing from documentation lang / en English language data and models and removed usage General spaCy usage labels May 14, 2020

adrianeboyd mentioned this issue May 19, 2020

Merge_noun_chunk trying to merge disjoint spans #5458

Closed

svlandeg mentioned this issue May 20, 2020

Bugfix in noun chunks #5470

Merged

3 tasks

honnibal closed this as completed in #5470 May 21, 2020

lock bot locked as resolved and limited conversation to collaborators Jun 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to integrate util.filter_spans in nlp.pipe() ? - ValueError: [E102] Can't merge non-disjoint spans. #5393

How to integrate util.filter_spans in nlp.pipe() ? - ValueError: [E102] Can't merge non-disjoint spans. #5393

MoritzLaurer commented May 3, 2020

adrianeboyd commented May 4, 2020

MoritzLaurer commented May 9, 2020

adrianeboyd commented May 14, 2020

adrianeboyd commented May 19, 2020

MoritzLaurer commented May 22, 2020

lock bot commented Jun 24, 2020

lock bot commented Jun 24, 2020

lock bot commented Jun 24, 2020

How to integrate util.filter_spans in nlp.pipe() ? - ValueError: [E102] Can't merge non-disjoint spans. #5393

How to integrate util.filter_spans in nlp.pipe() ? - ValueError: [E102] Can't merge non-disjoint spans. #5393

Comments

MoritzLaurer commented May 3, 2020

How to reproduce the behaviour

Your Environment

adrianeboyd commented May 4, 2020

MoritzLaurer commented May 9, 2020

adrianeboyd commented May 14, 2020

adrianeboyd commented May 19, 2020

MoritzLaurer commented May 22, 2020

lock bot commented Jun 24, 2020

lock bot commented Jun 24, 2020

lock bot commented Jun 24, 2020