-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spacy tokenizer hangs #4362
Comments
Temporary solution for your use case: customize the tokenizer suffix re to exclude Better solution: improve the |
The lookbehind in the |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
How to reproduce the behaviour
I identified a text that causes the tokenizer to [apparently] hang. See the sample script below.
I'm not sure if the tokenizer will ever terminate; from hanging up a batch job that was given about 48 hours for processing, I'm fairly certain that this text causes the tokenizer to at least block for 35+ hours. (The test code above blocks for at least > 5 mins; in contrast, the string of 13200 Snowman emoji '☃' tokenizes in about 22 seconds. [The link without the emoji tokenizes rapidly, as expected.])
Note that neither the Snowman emoji nor the link are sufficient on their own to cause the tokenizer to hang; it's something about the combination of the link and the long string of emoji.
On interrupt:
Maybe related to issues #2835 #2744
Note that this is a naturally occurring text provided by a user in a social media environment; to decrease the size of the text, I multiplied out the snowman emoji by the number of times it occurred in the original text.
Your Environment
The text was updated successfully, but these errors were encountered: