-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Correction of default lemmatizer lookup in English (Issue # 4104) #4110
Conversation
Hi @ajrader, thanks for your pull request! 👍 It looks like you haven't filled in the spaCy Contributor Agreement (SCA) yet. The agrement ensures that we can use your contribution across the project. Once you've filled in the template, put it in the If you've already included the Contributor Agreement in your pull request above, you can ignore this message. |
"""Test that English lookup lemmatization of spun & dry are correct""" | ||
doc = get_doc(en_vocab, [t for t in text.split(" ")]) | ||
expected = {'dry': 'dry', 'spun': 'spin', 'spun-dry': 'spin-dry'} | ||
assert [token.lemma_ for token in doc] == list(expected.values()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding the test! Looks like this one failed on Python 3.5 because dicts aren't ordered yet, so values()
returns the key in a different order. (Totally not your fault btw, it's not exactly intuitive.) Calling sorted
around both lists should resolve this and make sure the order is always the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I went ahead and streamlined the test to not parametrize the string and compare it directly to a list of expected results.
from ..util import get_doc | ||
|
||
@pytest.mark.parametrize('text', ['dry spun spun-dry']) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
import pytest | ||
from ..util import get_doc | ||
|
||
@pytest.mark.parametrize('text', ['dry spun spun-dry']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure we need to parametrize here because the expected values are hard-coded into the test anyways. So there's not really a motivation to try out different words here. So feel free to move the string 'dry spun spun-dry'
into the function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, thanks for creating the PR !
…plosion#4110) * pytest file for issue4104 established * edited default lookup english lemmatizer for spun; fixes issue 4102 * eliminated parameterization and sorted dictionary dependnency in issue 4104 test * added contributor agreement
Resolves issue 4102.
Description
Types of change
mainly feat / lemmatizer bug fix.
Note that this change will become obsolete once using a default lookup for a language is implemented.
Checklist