Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correction of default lemmatizer lookup in English (Issue # 4104) #4110

Merged
merged 6 commits into from
Aug 15, 2019
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions spacy/lang/en/lemmatizer/lookup.py
Original file line number Diff line number Diff line change
Expand Up @@ -11558,7 +11558,7 @@
"drunker": "drunk",
"drunkest": "drunk",
"drunks": "drunk",
"dry": "spin-dry",
"dry": "dry",
"dry-cleaned": "dry-clean",
"dry-cleaners": "dry-cleaner",
"dry-cleaning": "dry-clean",
Expand Down Expand Up @@ -35294,7 +35294,8 @@
"spryer": "spry",
"spryest": "spry",
"spuds": "spud",
"spun": "spin-dry",
"spun": "spin",
"spun-dry": "spin-dry",
"spunkier": "spunky",
"spunkiest": "spunky",
"spunks": "spunk",
Expand Down
13 changes: 13 additions & 0 deletions spacy/tests/regression/test_issue4104.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# coding: utf8
from __future__ import unicode_literals

import pytest
from ..util import get_doc

@pytest.mark.parametrize('text', ['dry spun spun-dry'])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure we need to parametrize here because the expected values are hard-coded into the test anyways. So there's not really a motivation to try out different words here. So feel free to move the string 'dry spun spun-dry' into the function.


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

def test_issue4104(en_tokenizer, en_vocab, text):
"""Test that English lookup lemmatization of spun & dry are correct"""
doc = get_doc(en_vocab, [t for t in text.split(" ")])
expected = {'dry': 'dry', 'spun': 'spin', 'spun-dry': 'spin-dry'}
assert [token.lemma_ for token in doc] == list(expected.values())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the test! Looks like this one failed on Python 3.5 because dicts aren't ordered yet, so values() returns the key in a different order. (Totally not your fault btw, it's not exactly intuitive.) Calling sorted around both lists should resolve this and make sure the order is always the same.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I went ahead and streamlined the test to not parametrize the string and compare it directly to a list of expected results.