Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Help with lemmatization, different results #3644

Closed
userFT opened this issue Apr 26, 2019 · 6 comments
Closed

Help with lemmatization, different results #3644

userFT opened this issue Apr 26, 2019 · 6 comments

Comments

@userFT
Copy link

userFT commented Apr 26, 2019

I'm currently using spaCy on Python. The model used is en-core-web-sm (2.1.0).

The following code is run to retrieve a list of words "cleansed" from a query

import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp(query)
list_words = []
for token in doc:
if token.text != ' ':
list_words.append(token.lemma_)

However I face a major issue, when running this code. For example, when the query is "processing of tea leaves". The result stored in list_words can be either ['processing', 'tea', 'leaf'] or ['processing', 'tea', 'leave'].

It seems that the result is not consistent. I cannot change my input/query (adding another word for context is not possible) and I really need to find the same result every time. I think the loading of the model may be the issue.

Why the result differ ? Can I load the model the "same" way everytime ? Did I miss a parameter to obtain the same result for ambiguous query ?

Thanks for your help

@DuyguA
Copy link
Contributor

DuyguA commented Apr 26, 2019

Can you check the POS-tags from such sentences from the input? Are the sentences are correctly tagged?

@userFT
Copy link
Author

userFT commented Apr 26, 2019

Hi @DuyguA, thank you very much for your answer.
In both cases - for "processing of tea leaves" - I got the following POS-tags : ['NN', 'NN', 'NNS'] (using token.tag_).
['processing', 'tea', 'leaf'] => ['NN', 'NN', 'NNS']
['processing', 'tea', 'leave'] => ['NN', 'NN', 'NNS']

It seems that the sentence is correctly tagged. I'm fine with any of those two results, I just want to be able to consistently "hit" the same result, either 'leaf' or 'leave'. (not sure if I made myself understandable).

@BramVanroy
Copy link
Contributor

This seems to be the same as #3484 and is fixed in PR #3646.

@userFT
Copy link
Author

userFT commented Apr 29, 2019

Looks like it's working for me. Thanks a lot!
When will it be added to the next release ?

@BramVanroy
Copy link
Contributor

If this completely solved your issue, please close this topic so that we can focus our attention on open issues.

@userFT userFT closed this as completed Apr 30, 2019
@lock
Copy link

lock bot commented May 30, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 30, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants