Inconsistent lemmatization across different python sessions #3484

rfriel · 2019-03-26T02:00:51Z

How to reproduce the behaviour

The following will either produce the output ['dose'] or ['dos']. The output will be the same in multiple runs during the same python session, but if you start a new python session you may instead see the other of the two (which is then consistent within that session).

import spacy
nlp = spacy.load("en")

print([tok.lemma_ for tok in nlp('doses')])

After some poking around, I think I've traced the issue to this line the function lemmatize defined in spacy/lemmatizer.py. Based on the comment there, it seems like the assumption is being made that calling list() with an argument of set type will produce a sorted list. As far as I understand, this is not true (at least in Python 3.6.4) and instead the returned list has an arbitrary order.

Further up the call stack, the list reaches this line in Morphology.lemmatize, which selects the first element of the returned list. This seems sufficient to explain the behavior.

Your Environment

spaCy version: 2.1.3
Platform: Darwin-18.2.0-x86_64-i386-64bit
Python version: 3.6.4
Models: en

The text was updated successfully, but these errors were encountered:

bjascob · 2019-04-13T15:16:43Z

I noticed the same thing why creating the pyinflect extension. The inconsistency is particularly problematic because it means I can't create exceptions for it.

I addition to the inconsistency, lemmas are often wrong. For instance...

spared -> [spare, spar]
hating -> [hate, hat]

When spaCy produces the incorrect lemma, pyinflect gives the wrong inflection back to the user.

I'm interested in seeing this get fixed, although I think there's more to do than just sort the list of forms. Ideally we'd fix some of the underlying issues, either with exceptions or a change to the heuristics. Is there a plan to make changes? I'm willing to do this or at least help but I'd like to know if you already have plans so I don't take a different path than you want or duplicate an ongoing effort.

bjascob · 2019-04-14T02:10:50Z

For a test I ran the portion of the Gutenberg corpus that's in NLTK (2.6M words) through spaCy and recorded the set of words that produced more than one form of lemma (via a small hack in spaCy's code). This file spacy_multiple_forms.txt has about 1200 entries where multiple lemmas were present.

Just reviewing the list, it's usually fairly obvious which form is correct (and it's often not form[0]). Lots of the words are spelling errors, but even for most of those it's obvious which one to choose. I haven't looked close enough yet to see if there's a simple update to the rules that we can make, but at a minimum we could hand select the correct answers for the dictionary words and put them in an exceptions file.

BramVanroy · 2019-04-17T09:33:45Z

For Python 3.7+, we can use dict keys as a proxy for sorted sets. This is even advised by a Python core developer here.

>>> list(dict.fromkeys('abracadabra'))
['a', 'b', 'r', 'c', 'd']

But of course that is too new and not supported in older versions. There are a number of packages out there that try to mimic an ordered set, but looking at development spaCy is trying to get rid of as much third-party dependencies as possible. Therefore, the best solution might be the OrderedDict's keys for older versions, and regular dict's keys for 3.7+.

bjascob · 2019-04-17T12:55:14Z

It sounds like you're suggesting the applied rules should have a an order of preference. Is this the case? I haven't seen this myself. If not, I think just sorting the few cases where there is more than one form at the end is probably the simplest. There's usually only 2 or 3 on the list to sort so it would be reasonably quick.

Just experimenting with this for a few minutes, I think there are some simple rules we could apply that would help select the correct form when multiples are return from the lemmatize method. For instance, wordnet._morphy uses min(sorted(infl.forms), key=len) as a heuristic. I didn't see spaCy using this. Additionally we could supply a set of simple metrics to pick out common word forms. For instance...

    choice = min(sorted(infl.forms), key=len)
    for form in infl.forms:
       # A common pattern is __[aiou]_e but not __ye and not __ione
        if len(form)>3 and form[-1]=='e' and form[-3] in 'aiou' and form[-2] != 'y' and form[-4:-1] != 'ion':
            choice = form
        # A common pattern is '__nce'
        if len(form)>3 and form[-1]=='e' and form[-3:-1] == 'nc':
            choice = form
    return choice

The above significantly reduces the incorrect forms from the list of ambiguous forms posted above. It's probably not the correct solution (use regular expressions, add more rules, etc..) but with a little thinking/experimenting we could probably come up with a reasonable set of heuristics.

Keep in mind, this is really only an issue with OOV words and words that happen to have multiple spellings in the dictionary. Removing some of the oddball alternate spellings from the word list would also help.

rfriel · 2019-04-17T16:52:06Z

This is all very interesting. In my opinion, the first priority is just to make the behavior deterministic, even if it deterministically chooses the wrong thing sometimes. This won't be ideal, but won't be any worse than current behavior, and will make it easier to develop with spacy. (I originally found this bug because I have some scripts that I need to be deterministic, and they use the lemmatizer.)

This could be done as a one-line change to use OrderedDict instead of set for deduplication, as BramVanroy suggests. I could write this PR unless someone else wants to.

Then, there is the issue of selecting the right form. I don't know how this is done in other lemmatizers, and wouldn't want to reinvent the wheel. bjascob, it sounds like you know more than I do about this topic, so I'll defer to you on how exactly to do this part. (Also, I wonder if there's anything like this already in development? Or if spacy has a solution for some languages, just not English? I'd recommend looking into these questions first.)

bjascob · 2019-04-17T19:31:18Z

Unfortunately there's no perfect way to select the correct form for words that are OOV when basic rules are applied (at least that I'm aware of). The suggested heuristic is really a band-aid that would correct some errors but you'd still be left with others.

I agree the most important thing is to make the process deterministic, at least for now. If you're willing to do a PR for this that would be great. A few notes though..

I see two places where order can get messed up. Both are in lemmatizer.py::lemmatize(string,..)

The parameter rules is a standard dictionary and has no ordering (OderedDict would fix this)
A few lines below, after converting forms to a set and them back to a list, any order is lost. This line could probably be eliminated by simply checking if not form in forms and the other line if form not in oov_forms

My suggestion would be, just to keep the logic simple, to simply return sorted(forms, key=len). Spacy's lemmatizer is basically the same as the NLTK / wornet.morphy lemmatizer. I looked at the NLTK code and when they have multiple forms to deal with they return the shortest one. This could be achieved by having lemmatizer return sorted(forms, key=len) so list is sorted by length instead of alphabetically. Given that NLTK does it this way, that seems to me like a good approach to me.

.. and after thinking about it a minute, since the forms only differ by a few characters at the end, sorting the list alphabetically would return the shortest form in most cases too.

* Fix inconsistant lemmatizer issue #3484 * Remove test case

lock · 2019-06-03T16:45:23Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

ines added bug Bugs and behaviour differing from documentation feat / lemmatizer Feature: Rule-based and lookup lemmatization labels Mar 26, 2019

bjascob mentioned this issue Apr 26, 2019

Fix inconsistant lemmatizer issue #3484 #3646

Merged

3 tasks

BramVanroy mentioned this issue Apr 26, 2019

Help with lemmatization, different results #3644

Closed

bjascob mentioned this issue May 1, 2019

Incorrect lemmatization of some words #3665

Closed

ines closed this as completed in #3646 May 4, 2019

ines pushed a commit that referenced this issue May 4, 2019

Fix inconsistant lemmatizer issue #3484 (#3646)

955b95c

* Fix inconsistant lemmatizer issue #3484 * Remove test case

lock bot locked as resolved and limited conversation to collaborators Jun 3, 2019

polm added the reproducibility Consistency, reproducibility, determinism, and randomness label Nov 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent lemmatization across different python sessions #3484

Inconsistent lemmatization across different python sessions #3484

rfriel commented Mar 26, 2019

bjascob commented Apr 13, 2019

bjascob commented Apr 14, 2019

BramVanroy commented Apr 17, 2019

bjascob commented Apr 17, 2019

rfriel commented Apr 17, 2019

bjascob commented Apr 17, 2019 •

edited

Loading

lock bot commented Jun 3, 2019

Inconsistent lemmatization across different python sessions #3484

Inconsistent lemmatization across different python sessions #3484

Comments

rfriel commented Mar 26, 2019

How to reproduce the behaviour

Your Environment

bjascob commented Apr 13, 2019

bjascob commented Apr 14, 2019

BramVanroy commented Apr 17, 2019

bjascob commented Apr 17, 2019

rfriel commented Apr 17, 2019

bjascob commented Apr 17, 2019 • edited Loading

lock bot commented Jun 3, 2019

bjascob commented Apr 17, 2019 •

edited

Loading