Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix inconsistant lemmatizer issue #3484 #3646

Merged
merged 2 commits into from
May 4, 2019
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions spacy/lemmatizer.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
# coding: utf8
from __future__ import unicode_literals
from collections import OrderedDict

from .symbols import POS, NOUN, VERB, ADJ, PUNCT, PROPN
from .symbols import VerbForm_inf, VerbForm_none, Number_sing, Degree_pos
Expand Down Expand Up @@ -118,8 +119,8 @@ def lemmatize(string, index, exceptions, rules):
forms.append(form)
else:
oov_forms.append(form)
# Remove duplicates, and sort forms generated by rules alphabetically.
forms = list(set(forms))
# Remove duplicates but preserve the ordering of applied "rules"
forms = list(OrderedDict.fromkeys(forms))
# Put exceptions at the front of the list, so they get priority.
# This is a dodgy heuristic -- but it's the best we can do until we get
# frequencies on this. We can at least prune out problematic exceptions,
Expand Down