Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spacy tokenizer hangs #4362

Closed
levon003 opened this issue Oct 2, 2019 · 3 comments · Fixed by #4374
Closed

Spacy tokenizer hangs #4362

levon003 opened this issue Oct 2, 2019 · 3 comments · Fixed by #4374
Labels
feat / tokenizer Feature: Tokenizer perf / speed Performance: speed

Comments

@levon003
Copy link

levon003 commented Oct 2, 2019

How to reproduce the behaviour

I identified a text that causes the tokenizer to [apparently] hang. See the sample script below.

text = "https://si0.twimg.com/profile_images/2711056064/4399ea260e5608718ba4f54960b51627.jpeg?awesome=3dhiQfF57zkZOHOuyKnORdmiqZSVDU08swJ1KmtpbNdWZbTKt1anNwJJB99a9stVShtj59XpA3yLxYRigamlXlxfwDWZ2LH2xKOqs2mHN46qgLBqb1H156JGhBryicZb1gmlKuC2vLMonWOCCA8ngGOSMwlSQai5LBNB4ZVAVrVPz2JoYwxUNDmtdGdGEgcdHuFFrHk7☃" + "☃" * 13200
print(text)
import spacy
nlp = spacy.load("en_core_web_lg")
tokenizer = nlp.Defaults.create_tokenizer(nlp)
print("loaded")
print(spacy.__version__)
for doc in tokenizer.pipe((text,)):
    print(doc)

I'm not sure if the tokenizer will ever terminate; from hanging up a batch job that was given about 48 hours for processing, I'm fairly certain that this text causes the tokenizer to at least block for 35+ hours. (The test code above blocks for at least > 5 mins; in contrast, the string of 13200 Snowman emoji '☃' tokenizes in about 22 seconds. [The link without the emoji tokenizes rapidly, as expected.])

Note that neither the Snowman emoji nor the link are sufficient on their own to cause the tokenizer to hang; it's something about the combination of the link and the long string of emoji.

On interrupt:

  File "tokenizer.pyx", line 142, in pipe
  File "tokenizer.pyx", line 125, in spacy.tokenizer.Tokenizer.__call__
  File "tokenizer.pyx", line 167, in spacy.tokenizer.Tokenizer._tokenize
  File "tokenizer.pyx", line 184, in spacy.tokenizer.Tokenizer._split_affixes

Maybe related to issues #2835 #2744

Note that this is a naturally occurring text provided by a user in a social media environment; to decrease the size of the text, I multiplied out the snowman emoji by the number of times it occurred in the original text.

Your Environment

  • spaCy version: 2.1.8
  • Platform: Linux-3.10.0-957.27.2.el7.x86_64-x86_64-with-centos-7.6.1810-Core
  • Python version: 3.7.3
  • Models: en
@svlandeg svlandeg added feat / tokenizer Feature: Tokenizer perf / speed Performance: speed labels Oct 2, 2019
@adrianeboyd
Copy link
Contributor

token_match returns the correct result but is very very slow, and since is a suffix (matched in LIST_ICONS), it's splitting off each snowman one by one, which is going to take a very very long time because it checks for token_match in every loop after an affix is removed. It will terminate, but not for a while.

Temporary solution for your use case: customize the tokenizer suffix re to exclude LIST_ICONS. Not a perfect solution, since then this text will be one long token instead of a lot of individual snowman tokens.

Better solution: improve the token_match re so it's not so slow.

@honnibal honnibal closed this as completed Oct 3, 2019
@adrianeboyd
Copy link
Contributor

The lookbehind in the URL_PATTERN is unacceptably slow here. Each token_match check takes almost 15 seconds on my laptop. I think the prefix and suffix detection can be handled better in the tokenizer.

@lock
Copy link

lock bot commented Nov 4, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Nov 4, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feat / tokenizer Feature: Tokenizer perf / speed Performance: speed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants