Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stop words missing for en_core_web_md #922

Closed
geoHeil opened this issue Mar 25, 2017 · 10 comments
Closed

stop words missing for en_core_web_md #922

geoHeil opened this issue Mar 25, 2017 · 10 comments
Labels
models Issues related to the statistical models

Comments

@geoHeil
Copy link

geoHeil commented Mar 25, 2017

New to spaCy I want to configure stopwords.
The regular spacy.en.STOP_WORDS do not seem to apply when loading the bigger file of en_core_web_md How can I configure the big file to use the regular stop words?

@honnibal
Copy link
Member

This sounds like a bug in the model, thanks.

The general-purpose answer is that flags like IS_STOP are computer per-type, so they're cached in the lexicon. You can add your own lexical flags or change how they're computed with the nlp.vocab.add_flag() method. You give this the flag ID and a function to compute the values, like this:

from spacy.attrs import IS_STOP
nlp.vocab.add_flag(IS_STOP, lambda string: string in my_stop_words)

This should be a good workaround for you until the model is updated.

@honnibal honnibal added the models Issues related to the statistical models label Mar 25, 2017
@honnibal
Copy link
Member

Btw could you run:

python -m spacy info --markdown
python -m spacy info en_core_web_md --markdown

And paste the results here?

Thanks,
Matt

@geoHeil
Copy link
Author

geoHeil commented Mar 26, 2017

Info about spaCy

  • spaCy version: 1.7.2
  • Platform: Darwin-16.4.0-x86_64-i386-64bit
  • Python version: 3.6.0
  • Installed models: en, en_core_web_md

and

Info about model en_core_web_md

  • lang: en
  • name: core_web_md
  • license: CC BY-SA 3.0
  • author: Explosion AI
  • url: https://explosion.ai
  • version: 1.2.1
  • spacy_version: >=1.7.0,<2.0.0
  • email: [email protected]
  • description: General-purpose English model, with tagging, parsing, entities and word vectors
  • source: /Users/geoheil/anaconda3/lib/python3.6/site-packages/en_core_web_md/en_core_web_md-1.2.1

@sadovnychyi
Copy link
Contributor

Same here. Correct workaround is:

nlp.vocab.add_flag(lambda s: s in spacy.en.word_sets.STOP_WORDS, spacy.attrs.IS_STOP)

(function first, ID later).

@pavlin99th
Copy link
Contributor

To include lower/upper/title -cased words (him/HIM/Him) I had to use:

nlp.vocab.add_flag(lambda s: s.lower() in spacy.en.word_sets.STOP_WORDS, spacy.attrs.IS_STOP)

@ines
Copy link
Member

ines commented Nov 9, 2017

The new en_core_web_md model for v2.0 is now available and the problem should be fixed in the new version: https://spacy.io/models/en#en_core_web_md 🎉

@ines ines closed this as completed Nov 9, 2017
@jmidyet
Copy link

jmidyet commented Dec 24, 2017

@ines , I'm using en_core_web_md v 2.0.0 and this continues to be an issue. Works just fine with the small model.

@georgek
Copy link

georgek commented Jan 24, 2018

Also having this problem with en_core_web_md v 2.0.0. I had to use the following as a workaround:

nlp.vocab.add_flag(lambda s: s.lower() in spacy.lang.en.stop_words.STOP_WORDS, spacy.attrs.IS_STOP)

@fmfn
Copy link

fmfn commented Mar 13, 2018

Same problem but with en_core_web_lg v 2.0.0. @georgek's Suggested workaround did the trick.

@lock
Copy link

lock bot commented May 7, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 7, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
models Issues related to the statistical models
Projects
None yet
Development

No branches or pull requests

8 participants