Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correction of default lemmatizer lookup in English (Issue # 4104) #4110

Merged
merged 6 commits into from
Aug 15, 2019

Conversation

ajrader
Copy link
Contributor

@ajrader ajrader commented Aug 12, 2019

Resolves issue 4102.

Description

  • made the following changes to lookup.py:
    • 'dry' : 'dry'
    • 'spun': 'spin'
    • 'spun-dry': 'spin-dry'
  • created new test (test_issue4104.py) and verified it passed after the above changes.

Types of change

mainly feat / lemmatizer bug fix.

Note that this change will become obsolete once using a default lookup for a language is implemented.

Checklist

  • I have submitted the spaCy Contributor Agreement.
  • I ran the tests, and all new and existing tests passed.
  • My changes don't require a change to the documentation, or if they do, I've added all required information.

@ines ines added feat / lemmatizer Feature: Rule-based and lookup lemmatization lang / en English language data and models perf / accuracy Performance: accuracy labels Aug 12, 2019
@explosion-bot
Copy link
Collaborator

Hi @ajrader, thanks for your pull request! 👍 It looks like you haven't filled in the spaCy Contributor Agreement (SCA) yet. The agrement ensures that we can use your contribution across the project. Once you've filled in the template, put it in the .github/contributors directory and add it to this pull request. If your pull request targets a branch that's not master, for example develop, make sure to submit the Contributor Agreement to the master branch. Thanks a lot!

If you've already included the Contributor Agreement in your pull request above, you can ignore this message.

"""Test that English lookup lemmatization of spun & dry are correct"""
doc = get_doc(en_vocab, [t for t in text.split(" ")])
expected = {'dry': 'dry', 'spun': 'spin', 'spun-dry': 'spin-dry'}
assert [token.lemma_ for token in doc] == list(expected.values())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the test! Looks like this one failed on Python 3.5 because dicts aren't ordered yet, so values() returns the key in a different order. (Totally not your fault btw, it's not exactly intuitive.) Calling sorted around both lists should resolve this and make sure the order is always the same.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I went ahead and streamlined the test to not parametrize the string and compare it directly to a list of expected results.

from ..util import get_doc

@pytest.mark.parametrize('text', ['dry spun spun-dry'])

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

import pytest
from ..util import get_doc

@pytest.mark.parametrize('text', ['dry spun spun-dry'])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure we need to parametrize here because the expected values are hard-coded into the test anyways. So there's not really a motivation to try out different words here. So feel free to move the string 'dry spun spun-dry' into the function.

Copy link
Member

@svlandeg svlandeg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks for creating the PR !

@ines ines merged commit 2f36487 into explosion:master Aug 15, 2019
polm pushed a commit to polm/spaCy that referenced this pull request Aug 18, 2019
…plosion#4110)

* pytest file for issue4104 established

* edited default lookup english lemmatizer for spun; fixes issue 4102

* eliminated parameterization and sorted dictionary dependnency in issue 4104 test

* added contributor agreement
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat / lemmatizer Feature: Rule-based and lookup lemmatization lang / en English language data and models perf / accuracy Performance: accuracy
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Wrong lemmatization using only tokenization (en language)
4 participants