-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistent lemmatization across different python sessions #3484
Comments
I noticed the same thing why creating the I addition to the inconsistency, lemmas are often wrong. For instance...
When spaCy produces the incorrect lemma, pyinflect gives the wrong inflection back to the user. I'm interested in seeing this get fixed, although I think there's more to do than just sort the list of forms. Ideally we'd fix some of the underlying issues, either with exceptions or a change to the heuristics. Is there a plan to make changes? I'm willing to do this or at least help but I'd like to know if you already have plans so I don't take a different path than you want or duplicate an ongoing effort. |
For a test I ran the portion of the Gutenberg corpus that's in NLTK (2.6M words) through spaCy and recorded the set of words that produced more than one form of lemma (via a small hack in spaCy's code). This file spacy_multiple_forms.txt has about 1200 entries where multiple lemmas were present. Just reviewing the list, it's usually fairly obvious which form is correct (and it's often not form[0]). Lots of the words are spelling errors, but even for most of those it's obvious which one to choose. I haven't looked close enough yet to see if there's a simple update to the rules that we can make, but at a minimum we could hand select the correct answers for the dictionary words and put them in an exceptions file. |
For Python 3.7+, we can use dict keys as a proxy for sorted sets. This is even advised by a Python core developer here.
But of course that is too new and not supported in older versions. There are a number of packages out there that try to mimic an ordered set, but looking at development spaCy is trying to get rid of as much third-party dependencies as possible. Therefore, the best solution might be the OrderedDict's keys for older versions, and regular dict's keys for 3.7+. |
It sounds like you're suggesting the applied rules should have a an order of preference. Is this the case? I haven't seen this myself. If not, I think just sorting the few cases where there is more than one form at the end is probably the simplest. There's usually only 2 or 3 on the list to sort so it would be reasonably quick. Just experimenting with this for a few minutes, I think there are some simple rules we could apply that would help select the correct form when multiples are return from the
The above significantly reduces the incorrect forms from the list of ambiguous forms posted above. It's probably not the correct solution (use regular expressions, add more rules, etc..) but with a little thinking/experimenting we could probably come up with a reasonable set of heuristics. Keep in mind, this is really only an issue with OOV words and words that happen to have multiple spellings in the dictionary. Removing some of the oddball alternate spellings from the word list would also help. |
This is all very interesting. In my opinion, the first priority is just to make the behavior deterministic, even if it deterministically chooses the wrong thing sometimes. This won't be ideal, but won't be any worse than current behavior, and will make it easier to develop with spacy. (I originally found this bug because I have some scripts that I need to be deterministic, and they use the lemmatizer.) This could be done as a one-line change to use Then, there is the issue of selecting the right form. I don't know how this is done in other lemmatizers, and wouldn't want to reinvent the wheel. bjascob, it sounds like you know more than I do about this topic, so I'll defer to you on how exactly to do this part. (Also, I wonder if there's anything like this already in development? Or if spacy has a solution for some languages, just not English? I'd recommend looking into these questions first.) |
Unfortunately there's no perfect way to select the correct form for words that are OOV when basic rules are applied (at least that I'm aware of). The suggested heuristic is really a band-aid that would correct some errors but you'd still be left with others. I agree the most important thing is to make the process deterministic, at least for now. If you're willing to do a PR for this that would be great. A few notes though.. I see two places where order can get messed up. Both are in
My suggestion would be, just to keep the logic simple, to simply return .. and after thinking about it a minute, since the forms only differ by a few characters at the end, sorting the list alphabetically would return the shortest form in most cases too. |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
How to reproduce the behaviour
The following will either produce the output
['dose']
or['dos']
. The output will be the same in multiple runs during the same python session, but if you start a new python session you may instead see the other of the two (which is then consistent within that session).After some poking around, I think I've traced the issue to this line the function
lemmatize
defined inspacy/lemmatizer.py
. Based on the comment there, it seems like the assumption is being made that callinglist()
with an argument ofset
type will produce a sorted list. As far as I understand, this is not true (at least in Python 3.6.4) and instead the returned list has an arbitrary order.Further up the call stack, the list reaches this line in
Morphology.lemmatize
, which selects the first element of the returned list. This seems sufficient to explain the behavior.Your Environment
The text was updated successfully, but these errors were encountered: