-
Notifications
You must be signed in to change notification settings - Fork 250
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
multithreading for Corpus creation #277
Comments
Here I found an example on how that should work with spacy, will explore and report https:/explosion/spaCy/blob/master/examples/pipeline/multi_processing.py |
So here is an example of running the corpus creation in multiple processes vs a single process. Unfortunately the multiprocessing version is even slower with 04:23 minutes for Corpus(12000 docs, 1776000 tokens) vs. 03:29 for single process. (2 cores 4 threads laptop) Is there something that could be done to actually profit from multiprocessing with textacy?
|
Hi @Motorrat , thanks for the detailed posts, and apologies for the belated reply. SpaCy recently (re-)implemented multiprocessing in its core I'm going to do some work to make |
Heads-up: #285 |
The update is now available in a release: https:/chartbeat-labs/textacy/releases/tag/0.10.0 |
context
I am trying to speed up the Corpus creation via parallel processing. The source is a dictionary of text strings and so the code looks roughly like this:
Currently it seems there is some bottleneck that prevents this from utilizing all processing power available. With a certain dataset I get roughly 3000 texts added per minute regardless from how many jobs are running: 1, 4, or 8 (threads). I have tried it on a 2 core laptop and on AWS 2xtralarge instance (4core/8threads) with the roughly same throughput on both.
A dataset of 3mln tokens takes roughly 5 minutes to get loaded into a Corpus. I wonder if this is really bound by the processing power or if there is some bottleneck that could be removed.
I have tried a workaround and implemented the same with preprocessing the texts into spacy docs:
Unfortunately there was no speed up so I guess I would need to open an issue against spacy and see if something could be done there
proposed solution
Add parallel processing option to textacy.corpus.Corpus(data,lang) and change it's signature to textacy.corpus.Corpus(lang,data,n_jobs=1)
alternative solutions?
Share the above doc snippets as the examples in the documentation.
The text was updated successfully, but these errors were encountered: