Can I multithread the tokenizer? #1321

phdowling · 2017-09-12T01:55:02Z

I'm trying to process (only tokenization for now) a lot of text (1B sents). The sentence boundaries are already detected, I really just want to split the text into tokens and write them to file, space separated. The default tokenizer in spacy is single threaded. Can I somehow call this exact tokenizer in a multithreaded (i.e. multiprocess) way?

phdowling · 2017-09-12T02:08:51Z

My workaround for now for this use case is to use multiprocessing's pool.imap on the TreebankWordTokenizer from nltk. I'd much rather be able to use spaCy though.
Maybe there's a way support could be added? I know it would cause issues with the GIL, but the reason it's needed is due to support for some custom user callbacks right? Maybe an option could be added to accept that incompatibility, in return for better speed.

honnibal · 2017-09-15T08:29:00Z

You can multi-process it just the same as you do with nltk. I usually use joblib, but there's no problem with using pool.imap.

There's an enhancement issue open about using multiprocessing in nlp.pipe. In the meantime, sure -- use nlp.make_doc() with multi-processing. This works best in spaCy 2, where the load times are very short. Further discussion should be here: #1303

lock · 2018-05-08T17:27:15Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

honnibal closed this as completed Sep 15, 2017

honnibal added the usage General spaCy usage label Sep 15, 2017

lock bot locked as resolved and limited conversation to collaborators May 8, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can I multithread the tokenizer? #1321

Can I multithread the tokenizer? #1321

phdowling commented Sep 12, 2017

phdowling commented Sep 12, 2017

honnibal commented Sep 15, 2017 •

edited

Loading

lock bot commented May 8, 2018

Can I multithread the tokenizer? #1321

Can I multithread the tokenizer? #1321

Comments

phdowling commented Sep 12, 2017

phdowling commented Sep 12, 2017

honnibal commented Sep 15, 2017 • edited Loading

lock bot commented May 8, 2018

honnibal commented Sep 15, 2017 •

edited

Loading