-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spacy tokenizer hangs #2835
Comments
I have identified root cause of the failure, match_token call is hanging. The underlying bit can be seen here When running in python interpreter and pressing control-c yields the stack trace
where "tokenizer.pyx", line 172 is
Performing testing on token_match, shows that this is the cause.
or equivalently
It appears that I have a newer version of regex installed. This fails with regex versions 2018.7.11, 2018.06.21, 2018.8.29 Rolling regex back to version 2017.4.5, this issue doesn't appear. An alternate solution is to disable token_match
|
Reviewing this some more, this seems to be adapted from which has been recently updated, to quote
And there is a python port of this at https://gist.github.com/pchc2005/b5f13e136a9c9bb2984e5b92802fc7c9 However for spacy, this version this may not be a drop in replacement, as in tokenizer_exceptions.py has the comment
|
Just tested it in the latest |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
The following lines of code hangs, I tried debugging it but the code goes through tokenizer.pyx which seems to hang, as other sentences/docs seem to go through tests for lower, prefix, suffix and so forth in en\lex_attrs.py
The result of this is a hung process, note the process is still using CPU.
The text was updated successfully, but these errors were encountered: