-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can I multithread the tokenizer? #1321
Comments
My workaround for now for this use case is to use multiprocessing's |
You can multi-process it just the same as you do with There's an enhancement issue open about using |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
I'm trying to process (only tokenization for now) a lot of text (1B sents). The sentence boundaries are already detected, I really just want to split the text into tokens and write them to file, space separated. The default tokenizer in spacy is single threaded. Can I somehow call this exact tokenizer in a multithreaded (i.e. multiprocess) way?
The text was updated successfully, but these errors were encountered: