-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Batch processing doesn't speed up? #4935
Comments
Hi, This design makes sense for languages that use spacy's built-in rule-based tokenizer, but might become more of a bottleneck for languages that use slower external tokenizers / word segmenters. We'll have to keep this in mind for the future, but for now, just I would suggest trying multiprocessing with |
Hi, Adrian: It indeed speed up a lot. However, it fails when I add a custom sentence segmenter into the pipeline:
If I suspend the nlp.add_pipe(), it is fine; otherwise, it reports the error:
What may cause this? |
That looks like a bug, I'll take a closer look tomorrow. |
Hi, Adrian: Just a reminder that it doesn't fail immediately. It fails on certain amount of data or certain sentences. I tested on one file with 70000 sentences, and it went all right. On another file it reported the error message when processing less than 3000 sentences. It might be caused by concurrent processing or some other causes. When I disabled the custom segmenter, it went all right. |
The bug is related to processing an empty text/doc. See #4935, which should fix it. |
It worked and thanks. |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
I loaded 300000 line of Chinese texts, each line is like a tweet length. The compared the batch and non-batch mode to compare its speed:
Batch:
Non batch:
Where 'text_lines' is an array. The two ways finish almost the same time consumed: 2.5 and 2.6 minutes. why doesn't it speed up? spaCy is using Jieba segmentation. Is possible that Jieba doesn't speed up?
The text was updated successfully, but these errors were encountered: