Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spacy tokenizer hangs #2835

Closed
darindf opened this issue Oct 9, 2018 · 4 comments
Closed

Spacy tokenizer hangs #2835

darindf opened this issue Oct 9, 2018 · 4 comments
Labels
feat / tokenizer Feature: Tokenizer perf / speed Performance: speed

Comments

@darindf
Copy link
Contributor

darindf commented Oct 9, 2018

The following lines of code hangs, I tried debugging it but the code goes through tokenizer.pyx which seems to hang, as other sentences/docs seem to go through tests for lower, prefix, suffix and so forth in en\lex_attrs.py

import spacy
nlp = spacy.load('en')
nlp('oow.jspsearch.eventoracleopenworldsearch.technologyoraclesolarissearch.technologystoragesearch.technologylinuxsearch.technologyserverssearch.technologyvirtualizationsearch.technologyengineeredsystemspcodewwmkmppscem:')

The result of this is a hung process, note the process is still using CPU.

@ines ines added feat / tokenizer Feature: Tokenizer perf / speed Performance: speed labels Oct 9, 2018
@darindf
Copy link
Contributor Author

darindf commented Oct 10, 2018

I have identified root cause of the failure, match_token call is hanging. The underlying bit can be seen here

When running in python interpreter and pressing control-c yields the stack trace

>>> nlp('oow.jspsearch.eventoracleopenworldsearch.technologyoraclesolarissearch.technologystoragesearch.technologylinuxsearch.technologyserverssearch.technologyvirtualizationsearch.technologyengineeredsystemspcodewwmkmppscem:')
Traceback (most recent call last):search.technologyvirtualizationsearch.technologyengineeredsystemspcodewwmkmppscem:'))
  File "<stdin>", line 1, in <module>
  File "c:\ProgramData\Anaconda3\lib\site-packages\spacy\language.py", line 346, in __call__
    doc = self.make_doc(text)
  File "c:\ProgramData\Anaconda3\lib\site-packages\spacy\language.py", line 378, in make_doc
    return self.tokenizer(text)
  File "tokenizer.pyx", line 116, in spacy.tokenizer.Tokenizer.__call__
  File "tokenizer.pyx", line 155, in spacy.tokenizer.Tokenizer._tokenize
  File "tokenizer.pyx", line 172, in spacy.tokenizer.Tokenizer._split_affixes

where "tokenizer.pyx", line 172 is

            if self.token_match and self.token_match(string):

Performing testing on token_match, shows that this is the cause.

from spacy.lang.tokenizer_exceptions import  TOKEN_MATCH

TOKEN_MATCH('oow.jspsearch.eventoracleopenworldsearch.technologyoraclesolarissearch.technologystoragesearch.technologylinuxsearch.technologyserverssearch.technologyvirtualizationsearch.technologyengineeredsystemspcodewwmkmppscem:')

or equivalently

nlp.tokenizer.token_match('oow.jspsearch.eventoracleopenworldsearch.technologyoraclesolarissearch.technologystoragesearch.technologylinuxsearch.technologyserverssearch.technologyvirtualizationsearch.technologyengineeredsystemspcodewwmkmppscem:')

It appears that I have a newer version of regex installed. This fails with regex versions 2018.7.11, 2018.06.21, 2018.8.29

Rolling regex back to version 2017.4.5, this issue doesn't appear.

An alternate solution is to disable token_match

import spacy
nlp = spacy.load('en')
nlp.tokenizer.token_match = None
nlp('oow.jspsearch.eventoracleopenworldsearch.technologyoraclesolarissearch.technologystoragesearch.technologylinuxsearch.technologyserverssearch.technologyvirtualizationsearch.technologyengineeredsystemspcodewwmkmppscem:')

@darindf
Copy link
Contributor Author

darindf commented Oct 10, 2018

Reviewing this some more, this seems to be adapted from
https://mathiasbynens.be/demo/url-regex, or more specifically this version, https://gist.github.com/dperini/729294

which has been recently updated, to quote

UPDATED TO FIX CRASHES / SLOWDOWN
After listening and collecting suggestions I updated the gist:

no more crashes and slowdown in browsers when testing long, random typed, and edge cases domain names
completely replaced the host/domain name part of the regular expression, shorter and yielding higher performances
added the underscore ('_') as a valid character, added check for length of each dot sub-parts (< 64 chars), smaller tweaks
Fix your forks and enjoy the new improvements and higher speed.

And there is a python port of this at https://gist.github.com/pchc2005/b5f13e136a9c9bb2984e5b92802fc7c9

However for spacy, this version this may not be a drop in replacement, as in tokenizer_exceptions.py has the comment

A few minor mods to this regex to account for use cases represented in test_urls

@ines
Copy link
Member

ines commented Dec 20, 2018

Just tested it in the latest spacy-nightly and seems like this issue is now fixed. It was likely related to the same issue as #2744, a problem with how the TOKEN_MATCH was compiled.

@ines ines closed this as completed Dec 20, 2018
@lock
Copy link

lock bot commented Jan 19, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Jan 19, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feat / tokenizer Feature: Tokenizer perf / speed Performance: speed
Projects
None yet
Development

No branches or pull requests

2 participants