Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent lemmatization across different python sessions #3484

Closed
rfriel opened this issue Mar 26, 2019 · 7 comments · Fixed by #3646
Closed

Inconsistent lemmatization across different python sessions #3484

rfriel opened this issue Mar 26, 2019 · 7 comments · Fixed by #3646
Labels
bug Bugs and behaviour differing from documentation feat / lemmatizer Feature: Rule-based and lookup lemmatization reproducibility Consistency, reproducibility, determinism, and randomness

Comments

@rfriel
Copy link

rfriel commented Mar 26, 2019

How to reproduce the behaviour

The following will either produce the output ['dose'] or ['dos']. The output will be the same in multiple runs during the same python session, but if you start a new python session you may instead see the other of the two (which is then consistent within that session).

import spacy
nlp = spacy.load("en")

print([tok.lemma_ for tok in nlp('doses')])

After some poking around, I think I've traced the issue to this line the function lemmatize defined in spacy/lemmatizer.py. Based on the comment there, it seems like the assumption is being made that calling list() with an argument of set type will produce a sorted list. As far as I understand, this is not true (at least in Python 3.6.4) and instead the returned list has an arbitrary order.

Further up the call stack, the list reaches this line in Morphology.lemmatize, which selects the first element of the returned list. This seems sufficient to explain the behavior.

Your Environment

  • spaCy version: 2.1.3
  • Platform: Darwin-18.2.0-x86_64-i386-64bit
  • Python version: 3.6.4
  • Models: en
@ines ines added bug Bugs and behaviour differing from documentation feat / lemmatizer Feature: Rule-based and lookup lemmatization labels Mar 26, 2019
@bjascob
Copy link
Contributor

bjascob commented Apr 13, 2019

I noticed the same thing why creating the pyinflect extension. The inconsistency is particularly problematic because it means I can't create exceptions for it.

I addition to the inconsistency, lemmas are often wrong. For instance...

spared -> [spare, spar]
hating -> [hate, hat]

When spaCy produces the incorrect lemma, pyinflect gives the wrong inflection back to the user.

I'm interested in seeing this get fixed, although I think there's more to do than just sort the list of forms. Ideally we'd fix some of the underlying issues, either with exceptions or a change to the heuristics. Is there a plan to make changes? I'm willing to do this or at least help but I'd like to know if you already have plans so I don't take a different path than you want or duplicate an ongoing effort.

@bjascob
Copy link
Contributor

bjascob commented Apr 14, 2019

For a test I ran the portion of the Gutenberg corpus that's in NLTK (2.6M words) through spaCy and recorded the set of words that produced more than one form of lemma (via a small hack in spaCy's code). This file spacy_multiple_forms.txt has about 1200 entries where multiple lemmas were present.

Just reviewing the list, it's usually fairly obvious which form is correct (and it's often not form[0]). Lots of the words are spelling errors, but even for most of those it's obvious which one to choose. I haven't looked close enough yet to see if there's a simple update to the rules that we can make, but at a minimum we could hand select the correct answers for the dictionary words and put them in an exceptions file.

@BramVanroy
Copy link
Contributor

For Python 3.7+, we can use dict keys as a proxy for sorted sets. This is even advised by a Python core developer here.

>>> list(dict.fromkeys('abracadabra'))
['a', 'b', 'r', 'c', 'd']

But of course that is too new and not supported in older versions. There are a number of packages out there that try to mimic an ordered set, but looking at development spaCy is trying to get rid of as much third-party dependencies as possible. Therefore, the best solution might be the OrderedDict's keys for older versions, and regular dict's keys for 3.7+.

@bjascob
Copy link
Contributor

bjascob commented Apr 17, 2019

It sounds like you're suggesting the applied rules should have a an order of preference. Is this the case? I haven't seen this myself. If not, I think just sorting the few cases where there is more than one form at the end is probably the simplest. There's usually only 2 or 3 on the list to sort so it would be reasonably quick.

Just experimenting with this for a few minutes, I think there are some simple rules we could apply that would help select the correct form when multiples are return from the lemmatize method. For instance, wordnet._morphy uses min(sorted(infl.forms), key=len) as a heuristic. I didn't see spaCy using this. Additionally we could supply a set of simple metrics to pick out common word forms. For instance...

    choice = min(sorted(infl.forms), key=len)
    for form in infl.forms:
       # A common pattern is __[aiou]_e but not __ye and not __ione
        if len(form)>3 and form[-1]=='e' and form[-3] in 'aiou' and form[-2] != 'y' and form[-4:-1] != 'ion':
            choice = form
        # A common pattern is '__nce'
        if len(form)>3 and form[-1]=='e' and form[-3:-1] == 'nc':
            choice = form
    return choice

The above significantly reduces the incorrect forms from the list of ambiguous forms posted above. It's probably not the correct solution (use regular expressions, add more rules, etc..) but with a little thinking/experimenting we could probably come up with a reasonable set of heuristics.

Keep in mind, this is really only an issue with OOV words and words that happen to have multiple spellings in the dictionary. Removing some of the oddball alternate spellings from the word list would also help.

@rfriel
Copy link
Author

rfriel commented Apr 17, 2019

This is all very interesting. In my opinion, the first priority is just to make the behavior deterministic, even if it deterministically chooses the wrong thing sometimes. This won't be ideal, but won't be any worse than current behavior, and will make it easier to develop with spacy. (I originally found this bug because I have some scripts that I need to be deterministic, and they use the lemmatizer.)

This could be done as a one-line change to use OrderedDict instead of set for deduplication, as BramVanroy suggests. I could write this PR unless someone else wants to.

Then, there is the issue of selecting the right form. I don't know how this is done in other lemmatizers, and wouldn't want to reinvent the wheel. bjascob, it sounds like you know more than I do about this topic, so I'll defer to you on how exactly to do this part. (Also, I wonder if there's anything like this already in development? Or if spacy has a solution for some languages, just not English? I'd recommend looking into these questions first.)

@bjascob
Copy link
Contributor

bjascob commented Apr 17, 2019

Unfortunately there's no perfect way to select the correct form for words that are OOV when basic rules are applied (at least that I'm aware of). The suggested heuristic is really a band-aid that would correct some errors but you'd still be left with others.

I agree the most important thing is to make the process deterministic, at least for now. If you're willing to do a PR for this that would be great. A few notes though..

I see two places where order can get messed up. Both are in lemmatizer.py::lemmatize(string,..)

  1. The parameter rules is a standard dictionary and has no ordering (OderedDict would fix this)
  2. A few lines below, after converting forms to a set and them back to a list, any order is lost. This line could probably be eliminated by simply checking if not form in forms and the other line if form not in oov_forms

My suggestion would be, just to keep the logic simple, to simply return sorted(forms, key=len). Spacy's lemmatizer is basically the same as the NLTK / wornet.morphy lemmatizer. I looked at the NLTK code and when they have multiple forms to deal with they return the shortest one. This could be achieved by having lemmatizer return sorted(forms, key=len) so list is sorted by length instead of alphabetically. Given that NLTK does it this way, that seems to me like a good approach to me.

.. and after thinking about it a minute, since the forms only differ by a few characters at the end, sorting the list alphabetically would return the shortest form in most cases too.

@ines ines closed this as completed in #3646 May 4, 2019
ines pushed a commit that referenced this issue May 4, 2019
* Fix inconsistant lemmatizer issue #3484

* Remove test case
@lock
Copy link

lock bot commented Jun 3, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Jun 3, 2019
@polm polm added the reproducibility Consistency, reproducibility, determinism, and randomness label Nov 22, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Bugs and behaviour differing from documentation feat / lemmatizer Feature: Rule-based and lookup lemmatization reproducibility Consistency, reproducibility, determinism, and randomness
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants