Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accessing custom Token's extension via Matcher #1499

Closed
damianoporta opened this issue Nov 6, 2017 · 5 comments
Closed

Accessing custom Token's extension via Matcher #1499

damianoporta opened this issue Nov 6, 2017 · 5 comments
Labels
🌙 nightly Discussion and contributions related to nightly builds usage General spaCy usage

Comments

@damianoporta
Copy link

Your Environment

  • Operating System: 16.04
  • Python Version Used: 3.5.2
  • spaCy Version Used: 2.0

Hello,
it seems not possible to access token's extension via matcher. Example:

nlp = spacy.load("it")
doc = nlp("Test 10 abc")

my_test = lambda token: 2
Token.set_extension('mytest', getter=my_test)

def add_event_ent(matcher, doc, i, matches):
    match_id, start, end = matches[i]
    span = doc[start: end]
    print(span)

matcher = Matcher(nlp.vocab)

pattern = [{'SHAPE': 'dd'}, {'MYTEST': 2}]
matcher.add('test', add_event_ent, pattern)

matches = matcher(doc)

I do not get errors but i see no matches.

Can i not use custom extensions via Matcher?

@ines
Copy link
Member

ines commented Nov 6, 2017

Ah - maybe this needs to be more clear in the docs. Token attributes and flags are two different things. Even though most built-in attributes translate to flags and token match attributes (e.g. is_stopIS_STOP, pos_POS), the matcher can't take advantage of the custom attributes, because it can only access the C-level data. Otherwise, it wouldn't be efficient enough.

So you have two options:

1. Work with flags instead of extension attributes

If you only need the lexeme (i.e. the lexical entry without its contextual attributes) and you can break your custom attribute down into a binary flag, you can use vocab.add_flag to add a flag with a getter that takes the token text and returns True or False.

# get ID for custom flag and add getter (in this case, it just returns length of token text)
IS_TEST = nlp.vocab.add_flag(lambda text: text in ['test', 'testing'])  # needs to be binary!
pattern = [{'SHAPE': 'dd'}, {IS_TEST: True}]

This is similar to the lexical attributes in the language data.

2. Match first, then check the extension attribute

This is the more flexible solution. Assuming you want one token of shape dd and another one that has your custom attribute set. You can first match dd tokens plus their following token and when you get a match, check if the second token has your custom attribute set. Since you have access to the whole token here, you can also access the ._ extensions:

pattern = [{'SHAPE': 'dd'}, {}]   #  empty dict for "any token", or specify IS_ALPHA etc.
matcher.add('test', None, pattern)

matches = matcher(doc)
for match_id, start, end in matches:
    span = doc[start : end]
    # all your matches are two tokens, so you can refer to span[1]
    if span[1]._.my_test == 2:
        print(span)
        # do something with your span here

You can also add an on_match callback as the second argument of matcher_add.

@ines ines added usage General spaCy usage 🌙 nightly Discussion and contributions related to nightly builds labels Nov 6, 2017
@damianoporta
Copy link
Author

Ok, I will follow the second approach. Thank you

@damianoporta
Copy link
Author

@ines pardon, one more question. Are the custom attributes used during the NER? Can i add custom features to improve accuracy?

@ines
Copy link
Member

ines commented Nov 8, 2017

@damianoporta No – spaCy can't know what custom attributes you've added and what they mean. And even if it did, you could only achieve accuracy improvements if you add the custom attributes as features during training, and then make them available when you run your custom model.

If you want to improve the NER accuracy, the best strategy is to extract training examples (e.g. using the matcher), and then update the model.

@lock
Copy link

lock bot commented May 8, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 8, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
🌙 nightly Discussion and contributions related to nightly builds usage General spaCy usage
Projects
None yet
Development

No branches or pull requests

2 participants