Distinction between outside, missing and blocked NER annotations #4307

svlandeg · 2019-09-18T14:47:40Z

Description

This PR attempts to process "empty" NER annotations more consistently.

Allow the NER algo to overwrite O (ent_iob == 2) annotations
Ensure that the NER algo preserves preset entities
Allow users to specify tokens that should never be in an entity. This "blocking" is done by setting doc.ents with a Span of tokens with empty ent_type. ent_iob is then set to 3. In the transition system, these are recognized as U- actions, i.e. UNIT actions without a label.
As a result of the rewrite, doc.ents = list(doc.ents) now actually keeps the annotations on the token level consistent, instead of resetting O to empty string. It does this by checking previous annotations for each token: if it was nered before, we put it at O, otherwise empty string. This seems to be the most intuitive behaviour for a user inspecting the token-level data.
Fixes Different ent_iob behavior after adding EntityRuler to pipeline #4267

Tests

I added some new unit tests in test_ner.py and for Issue 4267.
I tested the "preserving previous entities" functionality with statistical models, cf here, showing how that works properly. Removed those tests because they rely on the models to be installed.
test_doc_add_entities_set_ents_iob was in the repo twice so I removed one, and changed the other to have O annotations.

Open questions

Some old tests failed because nn_parser.move_names now contains U-. For now I removed it explicitely from move_names, but we could also adjust the unit tests. Depends on whether or not we want to keep that action internal.

Caveat

For the "blocking" functionality to work with the statistical models, they'll have to be retrained.

Types of change

Enhancement

Checklist

I have submitted the spaCy Contributor Agreement.
I ran the tests, and all new and existing tests passed.
My changes don't require a change to the documentation, or if they do, I've added all required information.

honnibal · 2019-09-18T19:16:48Z

spacy/tests/regression/test_issue1-1000.py

@@ -426,7 +426,7 @@ def test_issue957(en_tokenizer):
 def test_issue999(train_data):
 """Test that adding entities and resuming training works passably OK.
 There are two issues here:
- 1) We have to readd labels. This isn't very nice.
+ 1) We have to read labels. This isn't very nice.


Heh, sorry --- should've punctuated that better.

Suggested change

1) We have to read labels. This isn't very nice.

1) We have to re-add labels. This isn't very nice.

honnibal · 2019-09-18T19:19:31Z

Great work!

Quick question: do we add U- to the model? If we don't add it up-front, we'll end up having to add it later, which goes through the resize logic. That makes things much more complicated.

~~I think we should have this as one of the moves that we ensure is present, like we do with the OUT action. The change would be made in the BiluoPushDown.get_actions() method.~~ You do exactly that, sorry!

Have you tried training a model after this change? The example training/train_ner.py script should be sufficient.

honnibal · 2019-09-18T19:36:56Z

I'm going to go ahead and merge this, because it requires retraining the NER models for v2.2, and I want to get that triggered overnight. If it turns out there's a problem we can revert.

svlandeg · 2019-09-18T20:22:23Z

No, I haven't tried training a model yet. Once we do, we can use these and these unit tests to test whether the new models work as expected. The first batch of tests was working with the old models, the second batch needs retraining (I wrote the tests before I realised that)

svlandeg added 20 commits September 16, 2019 10:14

remove duplicate unit test

dff3592

unit test (currently failing) for issue 4267

555b149

bugfix: ensure doc.ents preserves kb_id annotations

6239e33

fix in setting doc.ents with empty label

b9d409f

rename

cff4c84

test for presetting an entity to a certain type

d361f3c

allow overwriting Outside + blocking presets

13d0887

fix actions when previous label needs to be kept

d42555f

fix default ent_iob in set entities

fe9424f

cleaner solution with U- action

8c4266a

remove debugging print statements

5086709

unit tests with explicit transitions and is_valid testing

5ca471c

remove U- from move_names explicitly

c613263

remove unit tests with pre-trained models that don't work

08b06e4

remove (working) unit tests with pre-trained models

fc779b3

clean up unit tests

8b21b8a

move unit tests

4e1d1f9

Merge remote-tracking branch 'upstream/master' into feature/iob-and-u

ce792e0

small fixes

2e764d8

remove two TODO's from doc.ents comments

f2b6009

svlandeg added enhancement Feature requests and improvements feat / ner Feature: Named Entity Recognizer labels Sep 18, 2019

svlandeg changed the title ~~Distinction between O (outside) and B- (blocked) NER annotations~~ Distinction between outside, missing and blocked NER annotations Sep 18, 2019

honnibal reviewed Sep 18, 2019

View reviewed changes

honnibal merged commit de5a9ec into explosion:master Sep 18, 2019

svlandeg deleted the feature/iob-and-u branch September 27, 2019 11:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distinction between outside, missing and blocked NER annotations #4307

Distinction between outside, missing and blocked NER annotations #4307

svlandeg commented Sep 18, 2019 •

edited

Loading

honnibal Sep 18, 2019

honnibal commented Sep 18, 2019 •

edited

Loading

honnibal commented Sep 18, 2019 •

edited

Loading

svlandeg commented Sep 18, 2019 •

edited

Loading

	1) We have to read labels. This isn't very nice.
	1) We have to re-add labels. This isn't very nice.

Distinction between outside, missing and blocked NER annotations #4307

Distinction between outside, missing and blocked NER annotations #4307

Conversation

svlandeg commented Sep 18, 2019 • edited Loading

Description

Tests

Open questions

Caveat

Types of change

Checklist

honnibal Sep 18, 2019

Choose a reason for hiding this comment

honnibal commented Sep 18, 2019 • edited Loading

honnibal commented Sep 18, 2019 • edited Loading

svlandeg commented Sep 18, 2019 • edited Loading

svlandeg commented Sep 18, 2019 •

edited

Loading

honnibal commented Sep 18, 2019 •

edited

Loading

honnibal commented Sep 18, 2019 •

edited

Loading

svlandeg commented Sep 18, 2019 •

edited

Loading