-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect lemmatization of some words #3665
Comments
Thanks for the work on this. You should probably hold off on making the corrections until at least v2.1.4 is released, as I did fix a bug in the lemmatizer. It might be affecting some of these results. The other thing that's happening with the lemmatizer is, we're switching over to support richer morphological features, which will allow us to write much better rules. This should improve accuracy significantly. |
If you're going to make lemmatizer changes, you might consider looking at LemmInflect (my code). It has a small character based neural-net classifier that looks at the word and selects 1 of 34 different automatically generated lemmatization rules. Since it's a NN, it works great on OOV words, as well as dictionary ones. The net is small enough that it doesn't take up hardly any CPU time and it's implemented with numpy so there aren't any additional 3rd party libs needed. The module also has code in it for parsing and using the NIH's SPECIALIST Lexicon which is a great resource for morphology info on English words. It has about 500K words in it and appears to be very accurate. Check it out if you need corpus resources for this. I'd be happy to contribute if you get to a point were you're interested. |
Really nice module, thanks! |
Merging this with the master issue in #2668! |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
Issue
For some words, spaCy doesn't produce the correct lemma. Using an automated method I found about 400 incorrect lemma forms. See mismatches.txt. This is a list of potential issues that would need to be reviewed by hand before inclusion in the exceptions lists.
Test Technique
I had spaCy parse (using en_core_web_sm) through the Gutenberg corpus (what's in NLTK) and then tested the produced lemma against a lookup table I created from the NIH SPECIALIST Lexicon. The table maps words to a list of potential base forms (it's basically a dictionary based lemmatizer instead of rules based). I didn't look at proper nouns or other forms that spaCy doesn't inflect (except adv where there were 28 un-handled by spaCy).
The test could easily be run with another model (which might give slightly different tagging in some cases) or a different corpus (which could include more words or possibly more "modern" English). If you have opinions on either, let me know.
Environment
**Note that I ran this against code that included PR Fix inconsistant lemmatizer issue #3484 #3646 (issue Inconsistent lemmatization across different python sessions #3484). The results would be (inconsistently) different otherwise (ie.. hating sometimes lemmatizes to "hat", dose to dos, etc..)
Proposed Fix
It's a bit of work to hand edit the above list and add it to the code so I wanted to check with the experts and get approval / opinions before I went through this. In addition, I'd like to verify that the inconsistencies PR, referenced above, will be included in the next release and that you aren't planning on any big changes to the lemmatizer. Those two things could impact the test results and list of changes.
Here's what I propose.
Alternately, instead of patching the holes we could consider upgrading to a more extensive set of rules and/or moving to a dictionary based approach. However, either of these would require a fair bit of work and could require that lemmatizer.py deal with English different from other languages.
Let me know your thoughts on this.
The text was updated successfully, but these errors were encountered: