Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenizer.pipe implementation does not match the docstring #1358

Closed
vmarkovtsev opened this issue Sep 23, 2017 · 8 comments
Closed

Tokenizer.pipe implementation does not match the docstring #1358

vmarkovtsev opened this issue Sep 23, 2017 · 8 comments
Labels
docs Documentation and website

Comments

@vmarkovtsev
Copy link

I've recently found Tokenizer.pipe() method which is supposed to tokenize faster using several threads.

However, I never observed more than 100% CPU usage. Then I looked into the code and found https:/explosion/spaCy/blob/master/spacy/tokenizer.pyx#L169 which ignores n_threads and batch_size and is absolutely equivalent to the regular iterations. I appreciate spaCy's sense of humour, but it would be nice to update the docstring.

@honnibal
Copy link
Member

honnibal commented Sep 23, 2017

The current docstring doesn't seem too bad here? It tells you that the implementation is single-threaded.

        """
        Tokenize a stream of texts.
        Arguments:
            texts: A sequence of unicode texts.
            batch_size (int):
                The number of texts to accumulate in an internal buffer.
            n_threads (int):
                The number of threads to use, if the implementation supports
                multi-threading. The default tokenizer is single-threaded.
        Yields:
            Doc A sequence of Doc objects, in order.
        """

@honnibal honnibal added the docs Documentation and website label Sep 24, 2017
@honnibal
Copy link
Member

Btw my proposed solution would be here: #1303

@vmarkovtsev
Copy link
Author

A-ha! Last night this looked to me as "the default n_threads is 1" and this morning it actually makes sense having read twice. Is there a way to have a multithreaded implementation and why it is not used by default? If this is for the future, why not **kwargs instead? E.g. my crazy hardcore tokenizer may run on a GPU or Phi or FPGA and have all sorts of varying parameters.

@honnibal
Copy link
Member

Well...**kwargs is sort of like, "What's your API?" "not telling ;)". It has to be used sometimes and it's creeping in, but I do like to avoid it when I can. If your tokenizer has crazy parameters it should read them on initialization in the config.

Multi-threading the tokenizer is hard because we have to interact with Python-land to allow customisation . Multi-processing seems much better here. The task is embarrassingly parallel and there's not much data to fan out to the workers.

@vmarkovtsev
Copy link
Author

OK, then why do we need these arguments after all? You say multithreading is very hard - read impossible - so?...

@honnibal
Copy link
Member

The tokenizer should have the same API as the other pipeline components (the parser, tagger, etc). I think it's reasonable to change the name of the parameter from n_threads to something that covers both threads and processes. I was thinking something like workers. This would open up the range of values a little bit.

@vmarkovtsev
Copy link
Author

All right.

Last time we had a problem and decided to use multiprocessing, we got two problems, so I really wish you good luck and steel nerves :)

@lock
Copy link

lock bot commented May 8, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 8, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
docs Documentation and website
Projects
None yet
Development

No branches or pull requests

2 participants