Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distinction between outside, missing and blocked NER annotations #4307

Merged
merged 20 commits into from
Sep 18, 2019

Conversation

svlandeg
Copy link
Member

@svlandeg svlandeg commented Sep 18, 2019

Description

This PR attempts to process "empty" NER annotations more consistently.

  • Allow the NER algo to overwrite O (ent_iob == 2) annotations
  • Ensure that the NER algo preserves preset entities
  • Allow users to specify tokens that should never be in an entity. This "blocking" is done by setting doc.ents with a Span of tokens with empty ent_type. ent_iob is then set to 3. In the transition system, these are recognized as U- actions, i.e. UNIT actions without a label.
  • As a result of the rewrite, doc.ents = list(doc.ents) now actually keeps the annotations on the token level consistent, instead of resetting O to empty string. It does this by checking previous annotations for each token: if it was nered before, we put it at O, otherwise empty string. This seems to be the most intuitive behaviour for a user inspecting the token-level data.
  • Fixes Different ent_iob behavior after adding EntityRuler to pipeline #4267

Tests

  • I added some new unit tests in test_ner.py and for Issue 4267.
  • I tested the "preserving previous entities" functionality with statistical models, cf here, showing how that works properly. Removed those tests because they rely on the models to be installed.
  • test_doc_add_entities_set_ents_iob was in the repo twice so I removed one, and changed the other to have O annotations.

Open questions

  • Some old tests failed because nn_parser.move_names now contains U-. For now I removed it explicitely from move_names, but we could also adjust the unit tests. Depends on whether or not we want to keep that action internal.

Caveat

For the "blocking" functionality to work with the statistical models, they'll have to be retrained.

Types of change

Enhancement

Checklist

  • I have submitted the spaCy Contributor Agreement.
  • I ran the tests, and all new and existing tests passed.
  • My changes don't require a change to the documentation, or if they do, I've added all required information.

@svlandeg svlandeg added enhancement Feature requests and improvements feat / ner Feature: Named Entity Recognizer labels Sep 18, 2019
@svlandeg svlandeg changed the title Distinction between O (outside) and B- (blocked) NER annotations Distinction between outside, missing and blocked NER annotations Sep 18, 2019
@@ -426,7 +426,7 @@ def test_issue957(en_tokenizer):
def test_issue999(train_data):
"""Test that adding entities and resuming training works passably OK.
There are two issues here:
1) We have to readd labels. This isn't very nice.
1) We have to read labels. This isn't very nice.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heh, sorry --- should've punctuated that better.

Suggested change
1) We have to read labels. This isn't very nice.
1) We have to re-add labels. This isn't very nice.

@honnibal
Copy link
Member

honnibal commented Sep 18, 2019

Great work!

Quick question: do we add U- to the model? If we don't add it up-front, we'll end up having to add it later, which goes through the resize logic. That makes things much more complicated.

I think we should have this as one of the moves that we ensure is present, like we do with the OUT action. The change would be made in the BiluoPushDown.get_actions() method. You do exactly that, sorry!

Have you tried training a model after this change? The example training/train_ner.py script should be sufficient.

@honnibal
Copy link
Member

honnibal commented Sep 18, 2019

I'm going to go ahead and merge this, because it requires retraining the NER models for v2.2, and I want to get that triggered overnight. If it turns out there's a problem we can revert.

@honnibal honnibal merged commit de5a9ec into explosion:master Sep 18, 2019
@svlandeg
Copy link
Member Author

svlandeg commented Sep 18, 2019

No, I haven't tried training a model yet. Once we do, we can use these and these unit tests to test whether the new models work as expected. The first batch of tests was working with the old models, the second batch needs retraining (I wrote the tests before I realised that)

@svlandeg svlandeg deleted the feature/iob-and-u branch September 27, 2019 11:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Feature requests and improvements feat / ner Feature: Named Entity Recognizer
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Different ent_iob behavior after adding EntityRuler to pipeline
2 participants