Accessing custom Token's extension via Matcher #1499

damianoporta · 2017-11-06T16:37:09Z

Your Environment

Operating System: 16.04
Python Version Used: 3.5.2
spaCy Version Used: 2.0

Hello,
it seems not possible to access token's extension via matcher. Example:

nlp = spacy.load("it")
doc = nlp("Test 10 abc")

my_test = lambda token: 2
Token.set_extension('mytest', getter=my_test)

def add_event_ent(matcher, doc, i, matches):
    match_id, start, end = matches[i]
    span = doc[start: end]
    print(span)

matcher = Matcher(nlp.vocab)

pattern = [{'SHAPE': 'dd'}, {'MYTEST': 2}]
matcher.add('test', add_event_ent, pattern)

matches = matcher(doc)

I do not get errors but i see no matches.

Can i not use custom extensions via Matcher?

The text was updated successfully, but these errors were encountered:

ines · 2017-11-06T18:16:25Z

Ah - maybe this needs to be more clear in the docs. Token attributes and flags are two different things. Even though most built-in attributes translate to flags and token match attributes (e.g. is_stop → IS_STOP, pos_ → POS), the matcher can't take advantage of the custom attributes, because it can only access the C-level data. Otherwise, it wouldn't be efficient enough.

So you have two options:

1. Work with flags instead of extension attributes

If you only need the lexeme (i.e. the lexical entry without its contextual attributes) and you can break your custom attribute down into a binary flag, you can use vocab.add_flag to add a flag with a getter that takes the token text and returns True or False.

# get ID for custom flag and add getter (in this case, it just returns length of token text)
IS_TEST = nlp.vocab.add_flag(lambda text: text in ['test', 'testing'])  # needs to be binary!
pattern = [{'SHAPE': 'dd'}, {IS_TEST: True}]

This is similar to the lexical attributes in the language data.

2. Match first, then check the extension attribute

This is the more flexible solution. Assuming you want one token of shape dd and another one that has your custom attribute set. You can first match dd tokens plus their following token and when you get a match, check if the second token has your custom attribute set. Since you have access to the whole token here, you can also access the ._ extensions:

pattern = [{'SHAPE': 'dd'}, {}]   #  empty dict for "any token", or specify IS_ALPHA etc.
matcher.add('test', None, pattern)

matches = matcher(doc)
for match_id, start, end in matches:
    span = doc[start : end]
    # all your matches are two tokens, so you can refer to span[1]
    if span[1]._.my_test == 2:
        print(span)
        # do something with your span here

You can also add an on_match callback as the second argument of matcher_add.

damianoporta · 2017-11-06T19:14:22Z

Ok, I will follow the second approach. Thank you

damianoporta · 2017-11-06T22:13:31Z

@ines pardon, one more question. Are the custom attributes used during the NER? Can i add custom features to improve accuracy?

ines · 2017-11-08T22:08:05Z

@damianoporta No – spaCy can't know what custom attributes you've added and what they mean. And even if it did, you could only achieve accuracy improvements if you add the custom attributes as features during training, and then make them available when you run your custom model.

If you want to improve the NER accuracy, the best strategy is to extract training examples (e.g. using the matcher), and then update the model.

lock · 2018-05-08T10:28:01Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

ines added usage General spaCy usage 🌙 nightly Discussion and contributions related to nightly builds labels Nov 6, 2017

damianoporta closed this as completed Nov 6, 2017

adam-ra mentioned this issue Jan 11, 2018

Token stems available for Matcher via vocab or Token custom attributes #1825

Closed

ines mentioned this issue Feb 12, 2018

💫 Better, faster and more customisable matcher #1971

Closed

5 tasks

lock bot locked as resolved and limited conversation to collaborators May 8, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accessing custom Token's extension via Matcher #1499

Accessing custom Token's extension via Matcher #1499

damianoporta commented Nov 6, 2017

ines commented Nov 6, 2017 •

edited

Loading

damianoporta commented Nov 6, 2017

damianoporta commented Nov 6, 2017

ines commented Nov 8, 2017

lock bot commented May 8, 2018

Accessing custom Token's extension via Matcher #1499

Accessing custom Token's extension via Matcher #1499

Comments

damianoporta commented Nov 6, 2017

Your Environment

ines commented Nov 6, 2017 • edited Loading

1. Work with flags instead of extension attributes

2. Match first, then check the extension attribute

damianoporta commented Nov 6, 2017

damianoporta commented Nov 6, 2017

ines commented Nov 8, 2017

lock bot commented May 8, 2018

ines commented Nov 6, 2017 •

edited

Loading