Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multiprocessing pipe (#1303) #4371

Merged
merged 41 commits into from
Oct 8, 2019

Conversation

tamuhey
Copy link
Contributor

@tamuhey tamuhey commented Oct 3, 2019

from #1303
Implent multiprocessing nlp.pipe.
You can easily parallelizing it to pass n_process argument, as follows:

nlp.pipe(texts,n_process=2)

The following link is a notebook that the execution time was simply measured.

https://gist.github.com/tamuhey/fce0d74ee129681fa13828d8872414db

Description

modification

  • modify Language.pipe
    • Add n_process argument
    • Parallelize the process if n != 1.
    • If n = 1, it works as before.
  • Add _apply_pipes function in language.py. This is the worker for multiprocessing.
  • Add simple test in test_language.py

implementation overview

Send batch of text (str) to workers, receive byte encoded docs which are created with Doc.to_bytes, and decode them to Docs with Doc.from_bytes
The reason for not receiving Docs directly is that Python pickles object in interprocess communication, but the cost of pickling Doc is generally very large and it significantly impairs performance.

Types of change

Enhancement

Checklist

  • I have submitted the spaCy Contributor Agreement.
  • I ran the tests, and all new and existing tests passed.
  • My changes don't require a change to the documentation, or if they do, I've added all required information.

@honnibal
Copy link
Member

honnibal commented Oct 3, 2019

I've wanted functionality like this for some time, so this is definitely cool. A couple of comments.

  • We should really use the new DocBin class to do the pickling.
  • We should make sure to test with the md and lg models, not just sm. If the models have vectors, it can change the runtime implications quite a lot.
  • It's a larger change, but it would be great if we could get the vectors into shared memory, so that we don't have to use multiple copies of it?
  • The runtime will be really ugly if people pass a small batch and use multiple processes. Should we try to second-guess this? I'm thinking probably not.

@svlandeg svlandeg added enhancement Feature requests and improvements feat / pipeline Feature: Processing pipeline and components scaling Scaling, serving and parallelizing spaCy labels Oct 3, 2019
@tamuhey
Copy link
Contributor Author

tamuhey commented Oct 4, 2019

@honnibal Thanks for your comment!

We should really use the new DocBin class to do the pickling.

Doc can be completely restored from DocBin?
I don't know how to restore Underscore from DocBin, so I didn't use DocBin.

We should make sure to test with the md and lg models, not just sm. If the models have vectors, it can change the runtime implications quite a lot.

Thanks, I didn't think the vectors.
I've updated the notebook to check if md and lg models outputs same docs as single process.

https://gist.github.com/tamuhey/fce0d74ee129681fa13828d8872414db

It's a larger change, but it would be great if we could get the vectors into shared memory, so that we don't have to use multiple copies of it?

I think so too, though it will be a difficult.

The runtime will be really ugly if people pass a small batch and use multiple processes. Should we try to second-guess this? I'm thinking probably not.

What do the runtime and second-guess mean?

@tamuhey
Copy link
Contributor Author

tamuhey commented Oct 5, 2019

It seems that the pipe test fails in Python2.7..

@tamuhey
Copy link
Contributor Author

tamuhey commented Oct 5, 2019

@honnibal I don't want to support Python2.7, what do you think?
If users set n_process > 1, show warnings and reset n_process=1.

@honnibal
Copy link
Member

honnibal commented Oct 5, 2019

@honnibal I don't want to support Python2.7, what do you think?
If users set n_process > 1, show warnings and reset n_process=1.

That's a great solution, go right ahead. If 3.5 gives you trouble you can do the same there.

@honnibal
Copy link
Member

honnibal commented Oct 7, 2019

@tamuhey Sorry for not being clear:

What do the runtime and second-guess mean?

I mean that the program might be slow if you ask it to process a small amount of text with multiple processes. For instance, I think using 2 processes and only 100 documents, the program will be much much slower than using 1 process and 100 documents.

The library could try to figure this out, and avoid introducing extra processes if they won't be helpful. This would mean the n_processes would be something like "maximum number of processes".

The advantage of this is that I think a majority of users would benefit from it, as people expect to always improve performance by adding more resources. The disadvantage is that we're trying to be clever, rather than just doing what the user tells us to do.

@tamuhey
Copy link
Contributor Author

tamuhey commented Oct 7, 2019

@honnibal Thanks, I got it.
The main challenge is not knowing the number of input texts in pipe (because the texts is iterator, which can have infinite length).
However, the data sent to the workers is a list made using minibatch function, so it can be counted.
So I think a possible measure is to adjust the number of process so that it fall below 1000 texts per core.
I'll work on it.

@honnibal
Copy link
Member

honnibal commented Oct 7, 2019

@tamuhey Let's try not to be too clever at first, and just do what we're told. We can always add logic to be "smarter" later.

@tamuhey
Copy link
Contributor Author

tamuhey commented Oct 7, 2019

@honnibal Ok, I will try it on another PR.

@honnibal
Copy link
Member

honnibal commented Oct 7, 2019

Nice to see it go green! I want to try it out a bit before merging, but in theory it looks good.

@tamuhey
Copy link
Contributor Author

tamuhey commented Oct 7, 2019

@honnibal
Please use this notebook!
https://gist.github.com/tamuhey/fce0d74ee129681fa13828d8872414db

@honnibal honnibal merged commit 650cbfe into explosion:master Oct 8, 2019
@Motorrat
Copy link

Motorrat commented Nov 7, 2019

I'm coming here following a reference from textacy chartbeat-labs/textacy#277
I have just downloaded the spacy 2.2.2 and tried out the multiprocessing pipe along the lines suggested by @tamuhey

number of docs 10000

started 4 process 2019-11-07 13:28:38
finished 4 process 2019-11-07 13:31:31

multiprocessing time 0:02:53.662849

started 1 process 2019-11-07 13:31:31
finished 1 process 2019-11-07 13:34:49

single process time 0:03:17.58610

as you can see there is not much speed advantage. I wonder why?

The system monitor (On ubuntu 18.04 64 bit, 2 cores/4 threads CPU) shows all four threads busy and 100% CPU with n_process=4 vs one thread and 25% CPU with one n_process=1.

from datetime import datetime
import spacy

'''
cd ~/venv
virtualenv spacy22 -p /usr/bin/python3.6
source spacy22/bin/activate
CFLAGS="-Wno-narrowing" pip install cld2-cffi
pip install numpy==1.17.2
pip install spacy
python -m spacy download en_core_web_lg
'''

text_string='''multiprocessing is a package that supports spawning processes using an API similar to the threading module. The multiprocessing package offers both local and remote concurrency, effectively side-stepping the Global Interpreter Lock by using subprocesses instead of threads. Due to this, the multiprocessing module allows the programmer to fully leverage multiple processors on a given machine. It runs on both Unix and Windows.

The multiprocessing module also introduces APIs which do not have analogs in the threading module. A prime example of this is the Pool object which offers a convenient means of parallelizing the execution of a function across multiple input values, distributing the input data across processes (data parallelism). The following example demonstrates the common practice of defining such functions in a module so that child processes can successfully import that module.'''

# let's create a list of strings with all uniq tokens, so the vocab will also be large
texts=[]
for i in range(10000):
    texts.extend([' '.join(str(i)+word for word in text_string.split()),])

print(texts[42])

print('number of docs',len(texts))

model = spacy.load('en_core_web_lg')

start = datetime.now()
print('started 4 process', start.strftime('%Y-%m-%d %H:%M:%S'))
docs0=list(model.pipe(texts,n_process=4))
finish = datetime.now()
print('finished 4 process', finish.strftime('%Y-%m-%d %H:%M:%S'))
print('##### multiprocessing time',finish - start)

start = datetime.now()
print('started 1 process', start.strftime('%Y-%m-%d %H:%M:%S'))
docs0=list(model.pipe(texts,n_process=1))
finish = datetime.now()
print('finished 1 process', finish.strftime('%Y-%m-%d %H:%M:%S'))
print('##### single process time',finish - start)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Feature requests and improvements feat / pipeline Feature: Processing pipeline and components scaling Scaling, serving and parallelizing spaCy
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants