multiprocessing pipe (#1303) #4371

tamuhey · 2019-10-03T18:55:11Z

from #1303
Implent multiprocessing nlp.pipe.
You can easily parallelizing it to pass n_process argument, as follows:

nlp.pipe(texts,n_process=2)

The following link is a notebook that the execution time was simply measured.

https://gist.github.com/tamuhey/fce0d74ee129681fa13828d8872414db

Description

modification

modify Language.pipe
- Add n_process argument
- Parallelize the process if n != 1.
- If n = 1, it works as before.
Add _apply_pipes function in language.py. This is the worker for multiprocessing.
Add simple test in test_language.py

implementation overview

Send batch of text (str) to workers, receive byte encoded docs which are created with Doc.to_bytes, and decode them to Docs with Doc.from_bytes
The reason for not receiving Docs directly is that Python pickles object in interprocess communication, but the cost of pickling Doc is generally very large and it significantly impairs performance.

Types of change

Enhancement

Checklist

I have submitted the spaCy Contributor Agreement.
I ran the tests, and all new and existing tests passed.
My changes don't require a change to the documentation, or if they do, I've added all required information.

…sing-pipe

honnibal · 2019-10-03T19:29:25Z

I've wanted functionality like this for some time, so this is definitely cool. A couple of comments.

We should really use the new DocBin class to do the pickling.
We should make sure to test with the md and lg models, not just sm. If the models have vectors, it can change the runtime implications quite a lot.
It's a larger change, but it would be great if we could get the vectors into shared memory, so that we don't have to use multiple copies of it?
The runtime will be really ugly if people pass a small batch and use multiple processes. Should we try to second-guess this? I'm thinking probably not.

tamuhey · 2019-10-04T03:23:37Z

@honnibal Thanks for your comment!

We should really use the new DocBin class to do the pickling.

Doc can be completely restored from DocBin?
I don't know how to restore Underscore from DocBin, so I didn't use DocBin.

We should make sure to test with the md and lg models, not just sm. If the models have vectors, it can change the runtime implications quite a lot.

Thanks, I didn't think the vectors.
I've updated the notebook to check if md and lg models outputs same docs as single process.

https://gist.github.com/tamuhey/fce0d74ee129681fa13828d8872414db

It's a larger change, but it would be great if we could get the vectors into shared memory, so that we don't have to use multiple copies of it?

I think so too, though it will be a difficult.

The runtime will be really ugly if people pass a small batch and use multiple processes. Should we try to second-guess this? I'm thinking probably not.

What do the runtime and second-guess mean?

tamuhey · 2019-10-05T09:20:06Z

It seems that the pipe test fails in Python2.7..

tamuhey · 2019-10-05T10:20:54Z

@honnibal I don't want to support Python2.7, what do you think?
If users set n_process > 1, show warnings and reset n_process=1.

honnibal · 2019-10-05T10:27:29Z

@honnibal I don't want to support Python2.7, what do you think?
If users set n_process > 1, show warnings and reset n_process=1.

That's a great solution, go right ahead. If 3.5 gives you trouble you can do the same there.

…sing-pipe

honnibal · 2019-10-07T11:25:18Z

@tamuhey Sorry for not being clear:

What do the runtime and second-guess mean?

I mean that the program might be slow if you ask it to process a small amount of text with multiple processes. For instance, I think using 2 processes and only 100 documents, the program will be much much slower than using 1 process and 100 documents.

The library could try to figure this out, and avoid introducing extra processes if they won't be helpful. This would mean the n_processes would be something like "maximum number of processes".

The advantage of this is that I think a majority of users would benefit from it, as people expect to always improve performance by adding more resources. The disadvantage is that we're trying to be clever, rather than just doing what the user tells us to do.

tamuhey · 2019-10-07T12:14:49Z

@honnibal Thanks, I got it.
The main challenge is not knowing the number of input texts in pipe (because the texts is iterator, which can have infinite length).
However, the data sent to the workers is a list made using minibatch function, so it can be counted.
So I think a possible measure is to adjust the number of process so that it fall below 1000 texts per core.
I'll work on it.

honnibal · 2019-10-07T13:53:27Z

@tamuhey Let's try not to be too clever at first, and just do what we're told. We can always add logic to be "smarter" later.

tamuhey · 2019-10-07T13:55:52Z

@honnibal Ok, I will try it on another PR.

honnibal · 2019-10-07T15:26:46Z

Nice to see it go green! I want to try it out a bit before merging, but in theory it looks good.

tamuhey · 2019-10-07T15:32:45Z

@honnibal
Please use this notebook!
https://gist.github.com/tamuhey/fce0d74ee129681fa13828d8872414db

Motorrat · 2019-11-07T12:42:14Z

I'm coming here following a reference from textacy chartbeat-labs/textacy#277
I have just downloaded the spacy 2.2.2 and tried out the multiprocessing pipe along the lines suggested by @tamuhey

number of docs 10000

started 4 process 2019-11-07 13:28:38
finished 4 process 2019-11-07 13:31:31

multiprocessing time 0:02:53.662849

started 1 process 2019-11-07 13:31:31
finished 1 process 2019-11-07 13:34:49

single process time 0:03:17.58610

as you can see there is not much speed advantage. I wonder why?

The system monitor (On ubuntu 18.04 64 bit, 2 cores/4 threads CPU) shows all four threads busy and 100% CPU with n_process=4 vs one thread and 25% CPU with one n_process=1.

from datetime import datetime
import spacy

'''
cd ~/venv
virtualenv spacy22 -p /usr/bin/python3.6
source spacy22/bin/activate
CFLAGS="-Wno-narrowing" pip install cld2-cffi
pip install numpy==1.17.2
pip install spacy
python -m spacy download en_core_web_lg
'''

text_string='''multiprocessing is a package that supports spawning processes using an API similar to the threading module. The multiprocessing package offers both local and remote concurrency, effectively side-stepping the Global Interpreter Lock by using subprocesses instead of threads. Due to this, the multiprocessing module allows the programmer to fully leverage multiple processors on a given machine. It runs on both Unix and Windows.

The multiprocessing module also introduces APIs which do not have analogs in the threading module. A prime example of this is the Pool object which offers a convenient means of parallelizing the execution of a function across multiple input values, distributing the input data across processes (data parallelism). The following example demonstrates the common practice of defining such functions in a module so that child processes can successfully import that module.'''

# let's create a list of strings with all uniq tokens, so the vocab will also be large
texts=[]
for i in range(10000):
    texts.extend([' '.join(str(i)+word for word in text_string.split()),])

print(texts[42])

print('number of docs',len(texts))

model = spacy.load('en_core_web_lg')

start = datetime.now()
print('started 4 process', start.strftime('%Y-%m-%d %H:%M:%S'))
docs0=list(model.pipe(texts,n_process=4))
finish = datetime.now()
print('finished 4 process', finish.strftime('%Y-%m-%d %H:%M:%S'))
print('##### multiprocessing time',finish - start)

start = datetime.now()
print('started 1 process', start.strftime('%Y-%m-%d %H:%M:%S'))
docs0=list(model.pipe(texts,n_process=1))
finish = datetime.now()
print('finished 1 process', finish.strftime('%Y-%m-%d %H:%M:%S'))
print('##### single process time',finish - start)

tamuhey added 16 commits September 22, 2019 01:47

refactor: separate formatting docs and golds in Language.update

8b67dca

fix return typo

d35f8bb

add pipe test

1a010ae

unpickleable object cannot be assigned to p.map

1b8c218

passed test pipe

26d3435

passed test!

9d2fc0f

pipe terminate

1660722

Merge branch 'master' into feature/multiprocessing-pipe

0d916cd

try pipe

dcff327

Merge remote-tracking branch 'origin/master' into feature/multiproces…

fe6ca33

…sing-pipe

Merge branch 'master' into feature/multiprocessing-pipe

a0d74c5

passed test

2f66ae3

fix ch

1fc2d13

add comments

06a46ab

fix len(texts)

42f6249

add comment

1132574

honnibal mentioned this pull request Oct 3, 2019

💫 Try multi-processing in v2 nlp.pipe()? #1303

Closed

svlandeg added enhancement Feature requests and improvements feat / pipeline Feature: Processing pipeline and components scaling Scaling, serving and parallelizing spaCy labels Oct 3, 2019

add comment

2153d80

tamuhey added 4 commits October 5, 2019 19:42

fix: multiprocessing of pipe is not supported in 2

03f7eae

test: use assert_docs_equal

fda15bb

fix: is_python3 -> is_python2

3125504

fix: change _pipe arg to use functools.partial

d136f59

tamuhey added 14 commits October 5, 2019 20:34

test: add vector modification test

790de01

test: add sample ner_pipe and user_data pipe

da6707b

add warnings test

a91e9b4

Merge remote-tracking branch 'origin/master' into feature/multiproces…

a752117

…sing-pipe

test: fix user warnings

c4c74a7

test: fix warnings capture

9010ccc

fix: remove islice import

6846ba5

test: remove warnings test

eee9415

test: add stream test

0921020

test: rename

900dfad

fix: multiproc stream

897ea98

fix: stream pipe

4320473

add comment

88beded

mp.Pipe seems to be able to use with relative small data

96c9c43

tamuhey added 2 commits October 7, 2019 22:25

test: skip stream test in python2

8d92628

sort imports

8da9b06

tamuhey added 2 commits October 7, 2019 22:57

test: add reason to skiptest

8073347

fix: use pipe for docs communucation

8103ec9

tamuhey added 2 commits October 8, 2019 00:28

add comments

9084997

add comment

41500a7

honnibal merged commit 650cbfe into explosion:master Oct 8, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multiprocessing pipe (#1303) #4371

multiprocessing pipe (#1303) #4371

tamuhey commented Oct 3, 2019

honnibal commented Oct 3, 2019

tamuhey commented Oct 4, 2019 •

edited

Loading

tamuhey commented Oct 5, 2019

tamuhey commented Oct 5, 2019

honnibal commented Oct 5, 2019

honnibal commented Oct 7, 2019

tamuhey commented Oct 7, 2019 •

edited

Loading

honnibal commented Oct 7, 2019

tamuhey commented Oct 7, 2019

honnibal commented Oct 7, 2019

tamuhey commented Oct 7, 2019

Motorrat commented Nov 7, 2019 •

edited

Loading

multiprocessing time 0:02:53.662849

single process time 0:03:17.58610

multiprocessing pipe (#1303) #4371

multiprocessing pipe (#1303) #4371

Conversation

tamuhey commented Oct 3, 2019

Description

modification

implementation overview

Types of change

Checklist

honnibal commented Oct 3, 2019

tamuhey commented Oct 4, 2019 • edited Loading

tamuhey commented Oct 5, 2019

tamuhey commented Oct 5, 2019

honnibal commented Oct 5, 2019

honnibal commented Oct 7, 2019

tamuhey commented Oct 7, 2019 • edited Loading

honnibal commented Oct 7, 2019

tamuhey commented Oct 7, 2019

honnibal commented Oct 7, 2019

tamuhey commented Oct 7, 2019

Motorrat commented Nov 7, 2019 • edited Loading

multiprocessing time 0:02:53.662849

single process time 0:03:17.58610

tamuhey commented Oct 4, 2019 •

edited

Loading

tamuhey commented Oct 7, 2019 •

edited

Loading

Motorrat commented Nov 7, 2019 •

edited

Loading