Tokenizer.pipe implementation does not match the docstring #1358

vmarkovtsev · 2017-09-23T21:23:55Z

I've recently found Tokenizer.pipe() method which is supposed to tokenize faster using several threads.

However, I never observed more than 100% CPU usage. Then I looked into the code and found https:/explosion/spaCy/blob/master/spacy/tokenizer.pyx#L169 which ignores n_threads and batch_size and is absolutely equivalent to the regular iterations. I appreciate spaCy's sense of humour, but it would be nice to update the docstring.

The text was updated successfully, but these errors were encountered:

honnibal · 2017-09-23T22:05:53Z

The current docstring doesn't seem too bad here? It tells you that the implementation is single-threaded.

        """
        Tokenize a stream of texts.
        Arguments:
            texts: A sequence of unicode texts.
            batch_size (int):
                The number of texts to accumulate in an internal buffer.
            n_threads (int):
                The number of threads to use, if the implementation supports
                multi-threading. The default tokenizer is single-threaded.
        Yields:
            Doc A sequence of Doc objects, in order.
        """

honnibal · 2017-09-24T08:16:11Z

Btw my proposed solution would be here: #1303

vmarkovtsev · 2017-09-24T08:33:18Z

A-ha! Last night this looked to me as "the default n_threads is 1" and this morning it actually makes sense having read twice. Is there a way to have a multithreaded implementation and why it is not used by default? If this is for the future, why not **kwargs instead? E.g. my crazy hardcore tokenizer may run on a GPU or Phi or FPGA and have all sorts of varying parameters.

honnibal · 2017-09-24T09:35:58Z

Well...**kwargs is sort of like, "What's your API?" "not telling ;)". It has to be used sometimes and it's creeping in, but I do like to avoid it when I can. If your tokenizer has crazy parameters it should read them on initialization in the config.

Multi-threading the tokenizer is hard because we have to interact with Python-land to allow customisation . Multi-processing seems much better here. The task is embarrassingly parallel and there's not much data to fan out to the workers.

vmarkovtsev · 2017-09-24T09:40:26Z

OK, then why do we need these arguments after all? You say multithreading is very hard - read impossible - so?...

honnibal · 2017-09-24T09:48:25Z

The tokenizer should have the same API as the other pipeline components (the parser, tagger, etc). I think it's reasonable to change the name of the parameter from n_threads to something that covers both threads and processes. I was thinking something like workers. This would open up the range of values a little bit.

vmarkovtsev · 2017-09-24T09:52:54Z

All right.

Last time we had a problem and decided to use multiprocessing, we got two problems, so I really wish you good luck and steel nerves :)

lock · 2018-05-08T16:27:47Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

honnibal added the docs Documentation and website label Sep 24, 2017

vmarkovtsev closed this as completed Sep 24, 2017

lock bot locked as resolved and limited conversation to collaborators May 8, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizer.pipe implementation does not match the docstring #1358

Tokenizer.pipe implementation does not match the docstring #1358

vmarkovtsev commented Sep 23, 2017

honnibal commented Sep 23, 2017 •

edited

Loading

honnibal commented Sep 24, 2017

vmarkovtsev commented Sep 24, 2017

honnibal commented Sep 24, 2017

vmarkovtsev commented Sep 24, 2017

honnibal commented Sep 24, 2017

vmarkovtsev commented Sep 24, 2017

lock bot commented May 8, 2018

Tokenizer.pipe implementation does not match the docstring #1358

Tokenizer.pipe implementation does not match the docstring #1358

Comments

vmarkovtsev commented Sep 23, 2017

honnibal commented Sep 23, 2017 • edited Loading

honnibal commented Sep 24, 2017

vmarkovtsev commented Sep 24, 2017

honnibal commented Sep 24, 2017

vmarkovtsev commented Sep 24, 2017

honnibal commented Sep 24, 2017

vmarkovtsev commented Sep 24, 2017

lock bot commented May 8, 2018

honnibal commented Sep 23, 2017 •

edited

Loading