💫 Try multi-processing in v2 nlp.pipe()? #1303

honnibal · 2017-09-06T08:30:43Z

In spaCy 1 multi-processing was a non-starter, for a variety of reasons. The model took a long time to load, and the integer ID mapping was stateful. These have been fixed in v2. At the same time, the v2 neural network model can't yet release the GIL, making multi-threading inefficient. We should therefore consider whether multi-processing would be a better solution.

The nlp.pipe() method is already a generator that takes a batch_size argument. I think it should be pretty easy to try out multi-processing here.

The text was updated successfully, but these errors were encountered:

souravsingh · 2017-09-14T17:31:33Z

@honnibal I am interested in working on the issue.

honnibal · 2017-09-16T09:44:46Z

@souravsingh Great! Here's the method that would need to change:

https:/explosion/spaCy/blob/develop/spacy/language.py#L433

I would suggest first working on getting the empty pipeline working (i.e. just the tokenizer). Then you can try the models.

The main complication you might encounter is that the v2 models use numpy, which multi-threads the matrix multiplications via OpenBlas. I'm not sure whether you'll have trouble with this in child processes. I also don't know whether the GPU will complain in child processes or not.

souravsingh · 2017-10-02T14:41:17Z

@honnibal Are we free to use joblib instead of multiprocessing?

honnibal · 2017-10-02T17:25:29Z

@souravsingh Yes, I like joblib.

ned2 · 2018-06-01T02:15:15Z

In case this is helpful, I've had success getting multiprocessing to work with spaCy by using the multiprocessing module from the pathos package as a drop in replacement for the standard library's multiprocessing module. In addition to other enhancements (I assume) it uses dill for pickling.

phdowling · 2018-06-20T13:02:37Z

Hey, what's the status here? Is anyone working on this?

jcw780 · 2019-08-26T21:53:32Z

Just out of curiosity, what is stopping you from releasing the GIL?
P.S.
Also weirdly enough pre-2.1.4 versions of spacy did appear to use multiple threads if you checked terminal. [2.1.4 does not yet without really affecting runtime it appears]

teoh · 2019-09-30T08:00:17Z

@souravsingh Great! Here's the method that would need to change:
https:/explosion/spaCy/blob/develop/spacy/language.py#L433

@honnibal just wanna check that this was still the right method to be changing (since this link is from two years back). I'm interested in picking this up, since it seems like it hasn't been completed yet.

honnibal · 2019-10-03T19:34:53Z

@teoh In case you're still thinking about this, have a look at #4371

* refactor: separate formatting docs and golds in Language.update * fix return typo * add pipe test * unpickleable object cannot be assigned to p.map * passed test pipe * passed test! * pipe terminate * try pipe * passed test * fix ch * add comments * fix len(texts) * add comment * add comment * fix: multiprocessing of pipe is not supported in 2 * test: use assert_docs_equal * fix: is_python3 -> is_python2 * fix: change _pipe arg to use functools.partial * test: add vector modification test * test: add sample ner_pipe and user_data pipe * add warnings test * test: fix user warnings * test: fix warnings capture * fix: remove islice import * test: remove warnings test * test: add stream test * test: rename * fix: multiproc stream * fix: stream pipe * add comment * mp.Pipe seems to be able to use with relative small data * test: skip stream test in python2 * sort imports * test: add reason to skiptest * fix: use pipe for docs communucation * add comments * add comment

lock · 2019-11-17T09:54:35Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

honnibal added enhancement Feature requests and improvements help wanted (easy) Contributions welcome! (also suited for spaCy beginners) 🌙 nightly Discussion and contributions related to nightly builds labels Sep 6, 2017

honnibal mentioned this issue Sep 15, 2017

Can I multithread the tokenizer? #1321

Closed

honnibal mentioned this issue Sep 24, 2017

Tokenizer.pipe implementation does not match the docstring #1358

Closed

ines added the help wanted Contributions welcome! label Oct 13, 2017

ines removed the 🌙 nightly Discussion and contributions related to nightly builds label Nov 9, 2017

tamuhey mentioned this issue Oct 3, 2019

multiprocessing pipe (#1303) #4371

Merged

3 tasks

svlandeg added the scaling Scaling, serving and parallelizing spaCy label Oct 3, 2019

ines closed this as completed Oct 18, 2019

lock bot locked as resolved and limited conversation to collaborators Nov 17, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

💫 Try multi-processing in v2 nlp.pipe()? #1303

💫 Try multi-processing in v2 nlp.pipe()? #1303

honnibal commented Sep 6, 2017

souravsingh commented Sep 14, 2017

honnibal commented Sep 16, 2017

souravsingh commented Oct 2, 2017

honnibal commented Oct 2, 2017

ned2 commented Jun 1, 2018

phdowling commented Jun 20, 2018

jcw780 commented Aug 26, 2019

teoh commented Sep 30, 2019

honnibal commented Oct 3, 2019

lock bot commented Nov 17, 2019

💫 Try multi-processing in v2 nlp.pipe()? #1303

💫 Try multi-processing in v2 nlp.pipe()? #1303

Comments

honnibal commented Sep 6, 2017

souravsingh commented Sep 14, 2017

honnibal commented Sep 16, 2017

souravsingh commented Oct 2, 2017

honnibal commented Oct 2, 2017

ned2 commented Jun 1, 2018

phdowling commented Jun 20, 2018

jcw780 commented Aug 26, 2019

teoh commented Sep 30, 2019

honnibal commented Oct 3, 2019

lock bot commented Nov 17, 2019