-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Overzealous lemmatisation of -ss nouns #903
Comments
Damn. I know exactly what must have caused this :( There's a base-form check in the lemmatizer --- if a word is listed as a base form, it shouldn't be lemmatized. Obviously this check is broken for nouns. |
Here: https:/explosion/spaCy/blob/master/spacy/lemmatizer.py#L52 This looks up the enum symbols for the verb-forms, but misses the enum symbols for the nouns. We just need to list the morphological features that indicate the noun is a base form, and list them here. We also need a regression test. I've got meetings today and most of tomorrow, so I'm hoping someone else can get the fix up? 🙇♂️ It will need a regression test too. |
The morphology class was calling the lemmatizer inconsistently, which some string-valued attributes. This caused Issue #903.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
The final -s is stripped even though the tag assigned is a singular noun (NN).
Some examples: sleepiness, incompleteness, loss, ass (unless they are recognised as proper nouns, which happens often if they are sentence-first).
A similar thing happens to nouns with other suffixes, e.g. anus → anu.
Seen in Spacy 1.7.2, model en_depent_web_md-1.2.1.
The text was updated successfully, but these errors were encountered: