Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect lemmatization of some words #3665

Closed
bjascob opened this issue May 1, 2019 · 5 comments
Closed

Incorrect lemmatization of some words #3665

bjascob opened this issue May 1, 2019 · 5 comments
Labels
feat / lemmatizer Feature: Rule-based and lookup lemmatization lang / en English language data and models perf / accuracy Performance: accuracy

Comments

@bjascob
Copy link
Contributor

bjascob commented May 1, 2019

Issue

For some words, spaCy doesn't produce the correct lemma. Using an automated method I found about 400 incorrect lemma forms. See mismatches.txt. This is a list of potential issues that would need to be reviewed by hand before inclusion in the exceptions lists.

Test Technique

I had spaCy parse (using en_core_web_sm) through the Gutenberg corpus (what's in NLTK) and then tested the produced lemma against a lookup table I created from the NIH SPECIALIST Lexicon. The table maps words to a list of potential base forms (it's basically a dictionary based lemmatizer instead of rules based). I didn't look at proper nouns or other forms that spaCy doesn't inflect (except adv where there were 28 un-handled by spaCy).

The test could easily be run with another model (which might give slightly different tagging in some cases) or a different corpus (which could include more words or possibly more "modern" English). If you have opinions on either, let me know.

Environment

Proposed Fix

It's a bit of work to hand edit the above list and add it to the code so I wanted to check with the experts and get approval / opinions before I went through this. In addition, I'd like to verify that the inconsistencies PR, referenced above, will be included in the next release and that you aren't planning on any big changes to the lemmatizer. Those two things could impact the test results and list of changes.

Here's what I propose.

  • For verb, noun adj, and adv, review and add valid exceptions to _verbs_irreg.py, etc..
  • For adv, additionally add a line to the if/else logic to Lemmatizer.py::call so the exceptions list is applied. Is there a reason these are excluded today? Note that I need to review and test this a bit more (specifically the call to "is_base_form") to assure there aren't any adverse consequences of adding in 'adv'.
  • TBD: Consider if the _adverbs_irreg.py list should programmatically be concatenated onto the _adjectives_irreg.py list in the 'init'. Words like 'farther' and 'farthest' can be either ADV or ADJ. Is this aways true for ADVs or just a few? I can either add the entire list or just replicate the few instances that I see in the corpus testing.

Alternately, instead of patching the holes we could consider upgrading to a more extensive set of rules and/or moving to a dictionary based approach. However, either of these would require a fair bit of work and could require that lemmatizer.py deal with English different from other languages.

Let me know your thoughts on this.

@honnibal honnibal added feat / lemmatizer Feature: Rule-based and lookup lemmatization perf / accuracy Performance: accuracy labels May 11, 2019
@honnibal
Copy link
Member

Thanks for the work on this. You should probably hold off on making the corrections until at least v2.1.4 is released, as I did fix a bug in the lemmatizer. It might be affecting some of these results.

The other thing that's happening with the lemmatizer is, we're switching over to support richer morphological features, which will allow us to write much better rules. This should improve accuracy significantly.

@bjascob
Copy link
Contributor Author

bjascob commented May 11, 2019

If you're going to make lemmatizer changes, you might consider looking at LemmInflect (my code). It has a small character based neural-net classifier that looks at the word and selects 1 of 34 different automatically generated lemmatization rules. Since it's a NN, it works great on OOV words, as well as dictionary ones. The net is small enough that it doesn't take up hardly any CPU time and it's implemented with numpy so there aren't any additional 3rd party libs needed.

The module also has code in it for parsing and using the NIH's SPECIALIST Lexicon which is a great resource for morphology info on English words. It has about 500K words in it and appears to be very accurate. Check it out if you need corpus resources for this.

I'd be happy to contribute if you get to a point were you're interested.

@honnibal
Copy link
Member

honnibal commented Jun 1, 2019

Really nice module, thanks!

@ines
Copy link
Member

ines commented Jun 1, 2019

Merging this with the master issue in #2668!

@ines ines closed this as completed Jun 1, 2019
@ines ines added the lang / en English language data and models label Jun 1, 2019
@lock
Copy link

lock bot commented Jul 1, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Jul 1, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feat / lemmatizer Feature: Rule-based and lookup lemmatization lang / en English language data and models perf / accuracy Performance: accuracy
Projects
None yet
Development

No branches or pull requests

3 participants