Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can I multithread the tokenizer? #1321

Closed
phdowling opened this issue Sep 12, 2017 · 3 comments
Closed

Can I multithread the tokenizer? #1321

phdowling opened this issue Sep 12, 2017 · 3 comments
Labels
usage General spaCy usage

Comments

@phdowling
Copy link

I'm trying to process (only tokenization for now) a lot of text (1B sents). The sentence boundaries are already detected, I really just want to split the text into tokens and write them to file, space separated. The default tokenizer in spacy is single threaded. Can I somehow call this exact tokenizer in a multithreaded (i.e. multiprocess) way?

@phdowling
Copy link
Author

My workaround for now for this use case is to use multiprocessing's pool.imap on the TreebankWordTokenizer from nltk. I'd much rather be able to use spaCy though.
Maybe there's a way support could be added? I know it would cause issues with the GIL, but the reason it's needed is due to support for some custom user callbacks right? Maybe an option could be added to accept that incompatibility, in return for better speed.

@honnibal
Copy link
Member

honnibal commented Sep 15, 2017

You can multi-process it just the same as you do with nltk. I usually use joblib, but there's no problem with using pool.imap.

There's an enhancement issue open about using multiprocessing in nlp.pipe. In the meantime, sure -- use nlp.make_doc() with multi-processing. This works best in spaCy 2, where the load times are very short. Further discussion should be here: #1303

@honnibal honnibal added the usage General spaCy usage label Sep 15, 2017
@lock
Copy link

lock bot commented May 8, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 8, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
usage General spaCy usage
Projects
None yet
Development

No branches or pull requests

2 participants