-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tokenizer.pipe implementation does not match the docstring #1358
Comments
The current docstring doesn't seem too bad here? It tells you that the implementation is single-threaded. """
Tokenize a stream of texts.
Arguments:
texts: A sequence of unicode texts.
batch_size (int):
The number of texts to accumulate in an internal buffer.
n_threads (int):
The number of threads to use, if the implementation supports
multi-threading. The default tokenizer is single-threaded.
Yields:
Doc A sequence of Doc objects, in order.
""" |
Btw my proposed solution would be here: #1303 |
A-ha! Last night this looked to me as "the default n_threads is 1" and this morning it actually makes sense having read twice. Is there a way to have a multithreaded implementation and why it is not used by default? If this is for the future, why not |
Well... Multi-threading the tokenizer is hard because we have to interact with Python-land to allow customisation . Multi-processing seems much better here. The task is embarrassingly parallel and there's not much data to fan out to the workers. |
OK, then why do we need these arguments after all? You say multithreading is very hard - read impossible - so?... |
The tokenizer should have the same API as the other pipeline components (the parser, tagger, etc). I think it's reasonable to change the name of the parameter from |
All right. Last time we had a problem and decided to use multiprocessing, we got two problems, so I really wish you good luck and steel nerves :) |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
I've recently found
Tokenizer.pipe()
method which is supposed to tokenize faster using several threads.However, I never observed more than 100% CPU usage. Then I looked into the code and found https:/explosion/spaCy/blob/master/spacy/tokenizer.pyx#L169 which ignores
n_threads
andbatch_size
and is absolutely equivalent to the regular iterations. I appreciate spaCy's sense of humour, but it would be nice to update the docstring.The text was updated successfully, but these errors were encountered: