-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
multiprocessing pipe (#1303) #4371
multiprocessing pipe (#1303) #4371
Conversation
I've wanted functionality like this for some time, so this is definitely cool. A couple of comments.
|
@honnibal Thanks for your comment!
Thanks, I didn't think the vectors. https://gist.github.com/tamuhey/fce0d74ee129681fa13828d8872414db
I think so too, though it will be a difficult.
What do |
It seems that the pipe test fails in Python2.7.. |
@honnibal I don't want to support Python2.7, what do you think? |
That's a great solution, go right ahead. If 3.5 gives you trouble you can do the same there. |
@tamuhey Sorry for not being clear:
I mean that the program might be slow if you ask it to process a small amount of text with multiple processes. For instance, I think using 2 processes and only 100 documents, the program will be much much slower than using 1 process and 100 documents. The library could try to figure this out, and avoid introducing extra processes if they won't be helpful. This would mean the The advantage of this is that I think a majority of users would benefit from it, as people expect to always improve performance by adding more resources. The disadvantage is that we're trying to be clever, rather than just doing what the user tells us to do. |
@honnibal Thanks, I got it. |
@tamuhey Let's try not to be too clever at first, and just do what we're told. We can always add logic to be "smarter" later. |
@honnibal Ok, I will try it on another PR. |
Nice to see it go green! I want to try it out a bit before merging, but in theory it looks good. |
@honnibal |
I'm coming here following a reference from textacy chartbeat-labs/textacy#277
as you can see there is not much speed advantage. I wonder why? The system monitor (On ubuntu 18.04 64 bit, 2 cores/4 threads CPU) shows all four threads busy and 100% CPU with n_process=4 vs one thread and 25% CPU with one n_process=1.
|
from #1303
Implent multiprocessing
nlp.pipe
.You can easily parallelizing it to pass
n_process
argument, as follows:The following link is a notebook that the execution time was simply measured.
https://gist.github.com/tamuhey/fce0d74ee129681fa13828d8872414db
Description
modification
Language.pipe
n_process
argument_apply_pipes
function in language.py. This is the worker for multiprocessing.test_language.py
implementation overview
Send batch of text (str) to workers, receive byte encoded docs which are created with
Doc.to_bytes
, and decode them toDoc
s withDoc.from_bytes
The reason for not receiving
Doc
s directly is that Python pickles object in interprocess communication, but the cost of picklingDoc
is generally very large and it significantly impairs performance.Types of change
Enhancement
Checklist