Spacy tokenizer hangs #2835

darindf · 2018-10-09T19:23:31Z

The following lines of code hangs, I tried debugging it but the code goes through tokenizer.pyx which seems to hang, as other sentences/docs seem to go through tests for lower, prefix, suffix and so forth in en\lex_attrs.py

import spacy
nlp = spacy.load('en')
nlp('oow.jspsearch.eventoracleopenworldsearch.technologyoraclesolarissearch.technologystoragesearch.technologylinuxsearch.technologyserverssearch.technologyvirtualizationsearch.technologyengineeredsystemspcodewwmkmppscem:')

The result of this is a hung process, note the process is still using CPU.

The text was updated successfully, but these errors were encountered:

darindf · 2018-10-10T19:17:22Z

I have identified root cause of the failure, match_token call is hanging. The underlying bit can be seen here

When running in python interpreter and pressing control-c yields the stack trace

>>> nlp('oow.jspsearch.eventoracleopenworldsearch.technologyoraclesolarissearch.technologystoragesearch.technologylinuxsearch.technologyserverssearch.technologyvirtualizationsearch.technologyengineeredsystemspcodewwmkmppscem:')
Traceback (most recent call last):search.technologyvirtualizationsearch.technologyengineeredsystemspcodewwmkmppscem:'))
  File "<stdin>", line 1, in <module>
  File "c:\ProgramData\Anaconda3\lib\site-packages\spacy\language.py", line 346, in __call__
    doc = self.make_doc(text)
  File "c:\ProgramData\Anaconda3\lib\site-packages\spacy\language.py", line 378, in make_doc
    return self.tokenizer(text)
  File "tokenizer.pyx", line 116, in spacy.tokenizer.Tokenizer.__call__
  File "tokenizer.pyx", line 155, in spacy.tokenizer.Tokenizer._tokenize
  File "tokenizer.pyx", line 172, in spacy.tokenizer.Tokenizer._split_affixes

where "tokenizer.pyx", line 172 is

            if self.token_match and self.token_match(string):

Performing testing on token_match, shows that this is the cause.

from spacy.lang.tokenizer_exceptions import  TOKEN_MATCH

TOKEN_MATCH('oow.jspsearch.eventoracleopenworldsearch.technologyoraclesolarissearch.technologystoragesearch.technologylinuxsearch.technologyserverssearch.technologyvirtualizationsearch.technologyengineeredsystemspcodewwmkmppscem:')

or equivalently

nlp.tokenizer.token_match('oow.jspsearch.eventoracleopenworldsearch.technologyoraclesolarissearch.technologystoragesearch.technologylinuxsearch.technologyserverssearch.technologyvirtualizationsearch.technologyengineeredsystemspcodewwmkmppscem:')

It appears that I have a newer version of regex installed. This fails with regex versions 2018.7.11, 2018.06.21, 2018.8.29

Rolling regex back to version 2017.4.5, this issue doesn't appear.

An alternate solution is to disable token_match

import spacy
nlp = spacy.load('en')
nlp.tokenizer.token_match = None
nlp('oow.jspsearch.eventoracleopenworldsearch.technologyoraclesolarissearch.technologystoragesearch.technologylinuxsearch.technologyserverssearch.technologyvirtualizationsearch.technologyengineeredsystemspcodewwmkmppscem:')

darindf · 2018-10-10T19:55:38Z

Reviewing this some more, this seems to be adapted from
https://mathiasbynens.be/demo/url-regex, or more specifically this version, https://gist.github.com/dperini/729294

which has been recently updated, to quote

UPDATED TO FIX CRASHES / SLOWDOWN
After listening and collecting suggestions I updated the gist:

no more crashes and slowdown in browsers when testing long, random typed, and edge cases domain names
completely replaced the host/domain name part of the regular expression, shorter and yielding higher performances
added the underscore ('_') as a valid character, added check for length of each dot sub-parts (< 64 chars), smaller tweaks
Fix your forks and enjoy the new improvements and higher speed.

And there is a python port of this at https://gist.github.com/pchc2005/b5f13e136a9c9bb2984e5b92802fc7c9

However for spacy, this version this may not be a drop in replacement, as in tokenizer_exceptions.py has the comment

A few minor mods to this regex to account for use cases represented in test_urls

ines · 2018-12-20T12:24:52Z

Just tested it in the latest spacy-nightly and seems like this issue is now fixed. It was likely related to the same issue as #2744, a problem with how the TOKEN_MATCH was compiled.

lock · 2019-01-19T12:33:36Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

ines added feat / tokenizer Feature: Tokenizer perf / speed Performance: speed labels Oct 9, 2018

ines closed this as completed Dec 20, 2018

lock bot locked as resolved and limited conversation to collaborators Jan 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spacy tokenizer hangs #2835

Spacy tokenizer hangs #2835

darindf commented Oct 9, 2018 •

edited

Loading

darindf commented Oct 10, 2018 •

edited

Loading

darindf commented Oct 10, 2018

ines commented Dec 20, 2018

lock bot commented Jan 19, 2019

Spacy tokenizer hangs #2835

Spacy tokenizer hangs #2835

Comments

darindf commented Oct 9, 2018 • edited Loading

darindf commented Oct 10, 2018 • edited Loading

darindf commented Oct 10, 2018

ines commented Dec 20, 2018

lock bot commented Jan 19, 2019

darindf commented Oct 9, 2018 •

edited

Loading

darindf commented Oct 10, 2018 •

edited

Loading