Spacy tokenizer hangs #4362

levon003 · 2019-10-02T20:05:07Z

How to reproduce the behaviour

I identified a text that causes the tokenizer to [apparently] hang. See the sample script below.

text = "https://si0.twimg.com/profile_images/2711056064/4399ea260e5608718ba4f54960b51627.jpeg?awesome=3dhiQfF57zkZOHOuyKnORdmiqZSVDU08swJ1KmtpbNdWZbTKt1anNwJJB99a9stVShtj59XpA3yLxYRigamlXlxfwDWZ2LH2xKOqs2mHN46qgLBqb1H156JGhBryicZb1gmlKuC2vLMonWOCCA8ngGOSMwlSQai5LBNB4ZVAVrVPz2JoYwxUNDmtdGdGEgcdHuFFrHk7☃" + "☃" * 13200
print(text)
import spacy
nlp = spacy.load("en_core_web_lg")
tokenizer = nlp.Defaults.create_tokenizer(nlp)
print("loaded")
print(spacy.__version__)
for doc in tokenizer.pipe((text,)):
    print(doc)

I'm not sure if the tokenizer will ever terminate; from hanging up a batch job that was given about 48 hours for processing, I'm fairly certain that this text causes the tokenizer to at least block for 35+ hours. (The test code above blocks for at least > 5 mins; in contrast, the string of 13200 Snowman emoji '☃' tokenizes in about 22 seconds. [The link without the emoji tokenizes rapidly, as expected.])

Note that neither the Snowman emoji nor the link are sufficient on their own to cause the tokenizer to hang; it's something about the combination of the link and the long string of emoji.

On interrupt:

  File "tokenizer.pyx", line 142, in pipe
  File "tokenizer.pyx", line 125, in spacy.tokenizer.Tokenizer.__call__
  File "tokenizer.pyx", line 167, in spacy.tokenizer.Tokenizer._tokenize
  File "tokenizer.pyx", line 184, in spacy.tokenizer.Tokenizer._split_affixes

Maybe related to issues #2835 #2744

Note that this is a naturally occurring text provided by a user in a social media environment; to decrease the size of the text, I multiplied out the snowman emoji by the number of times it occurred in the original text.

Your Environment

spaCy version: 2.1.8
Platform: Linux-3.10.0-957.27.2.el7.x86_64-x86_64-with-centos-7.6.1810-Core
Python version: 3.7.3
Models: en

The text was updated successfully, but these errors were encountered:

adrianeboyd · 2019-10-03T08:04:11Z

token_match returns the correct result but is very very slow, and since ☃ is a suffix (matched in LIST_ICONS), it's splitting off each snowman one by one, which is going to take a very very long time because it checks for token_match in every loop after an affix is removed. It will terminate, but not for a while.

Temporary solution for your use case: customize the tokenizer suffix re to exclude LIST_ICONS. Not a perfect solution, since then this text will be one long token instead of a lot of individual snowman tokens.

Better solution: improve the token_match re so it's not so slow.

adrianeboyd · 2019-10-04T07:56:34Z

The lookbehind in the URL_PATTERN is unacceptably slow here. Each token_match check takes almost 15 seconds on my laptop. I think the prefix and suffix detection can be handled better in the tokenizer.

lock · 2019-11-04T11:54:29Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

svlandeg added feat / tokenizer Feature: Tokenizer perf / speed Performance: speed labels Oct 2, 2019

honnibal closed this as completed Oct 3, 2019

adrianeboyd reopened this Oct 4, 2019

adrianeboyd mentioned this issue Oct 4, 2019

Improve URL_PATTERN and handling in tokenizer #4374

Merged

3 tasks

honnibal closed this as completed in #4374 Oct 5, 2019

adrianeboyd mentioned this issue Nov 2, 2019

prefix_search overriding token_match in tokenizer #4573

Closed

lock bot locked as resolved and limited conversation to collaborators Nov 4, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spacy tokenizer hangs #4362

Spacy tokenizer hangs #4362

levon003 commented Oct 2, 2019

adrianeboyd commented Oct 3, 2019

adrianeboyd commented Oct 4, 2019

lock bot commented Nov 4, 2019

Spacy tokenizer hangs #4362

Spacy tokenizer hangs #4362

Comments

levon003 commented Oct 2, 2019

How to reproduce the behaviour

Your Environment

adrianeboyd commented Oct 3, 2019

adrianeboyd commented Oct 4, 2019

lock bot commented Nov 4, 2019