multithreading for Corpus creation #277

Motorrat · 2019-09-29T09:58:58Z

context

I am trying to speed up the Corpus creation via parallel processing. The source is a dictionary of text strings and so the code looks roughly like this:

corpus = textacy.corpus.Corpus(lang=model)

def add_doc(key,value):
    corpus.add(tuple([value, {'key':key}],))

Parallel( n_jobs=-1, prefer="threads", verbose=5)(
    delayed(add_doc)(key,value) for key,value in texts_dict.items())

Currently it seems there is some bottleneck that prevents this from utilizing all processing power available. With a certain dataset I get roughly 3000 texts added per minute regardless from how many jobs are running: 1, 4, or 8 (threads). I have tried it on a 2 core laptop and on AWS 2xtralarge instance (4core/8threads) with the roughly same throughput on both.

A dataset of 3mln tokens takes roughly 5 minutes to get loaded into a Corpus. I wonder if this is really bound by the processing power or if there is some bottleneck that could be removed.

I have tried a workaround and implemented the same with preprocessing the texts into spacy docs:

corpus = textacy.corpus.Corpus(lang=model)

spacy_docs=[]
def add_spacy_doc(key,value):
    spacy_docs.append(model(value))

Parallel( n_jobs=-1, prefer="threads", verbose=5)(
    delayed(add_spacy_doc)(key,value) for key,value in texts_dict.items())

corpus.add_docs(spacy_docs)

Unfortunately there was no speed up so I guess I would need to open an issue against spacy and see if something could be done there

proposed solution

Add parallel processing option to textacy.corpus.Corpus(data,lang) and change it's signature to textacy.corpus.Corpus(lang,data,n_jobs=1)

alternative solutions?

Share the above doc snippets as the examples in the documentation.

The text was updated successfully, but these errors were encountered:

Motorrat · 2019-09-29T14:51:57Z

Here I found an example on how that should work with spacy, will explore and report https:/explosion/spaCy/blob/master/examples/pipeline/multi_processing.py

Motorrat · 2019-10-12T11:00:18Z

So here is an example of running the corpus creation in multiple processes vs a single process. Unfortunately the multiprocessing version is even slower with 04:23 minutes for Corpus(12000 docs, 1776000 tokens) vs. 03:29 for single process. (2 cores 4 threads laptop)

Is there something that could be done to actually profit from multiprocessing with textacy?

from multiprocessing.managers import BaseManager
from joblib import Parallel, delayed
from datetime import datetime

import spacy
import textacy
from spacy.util import minibatch

text_string='''multiprocessing is a package that supports spawning processes using an API similar to the threading module. The multiprocessing package offers both local and remote concurrency, effectively side-stepping the Global Interpreter Lock by using subprocesses instead of threads. Due to this, the multiprocessing module allows the programmer to fully leverage multiple processors on a given machine. It runs on both Unix and Windows.

The multiprocessing module also introduces APIs which do not have analogs in the threading module. A prime example of this is the Pool object which offers a convenient means of parallelizing the execution of a function across multiple input values, distributing the input data across processes (data parallelism). The following example demonstrates the common practice of defining such functions in a module so that child processes can successfully import that module.'''

# let's create a list of strings with all uniq tokens, so the vocab will also be large
texts=[]
for i in range(12000):
    texts.extend([' '.join(str(i)+word for word in text_string.split()),])

print(texts[1])

print('number of docs',len(texts))

batch_size=1000
partitions = minibatch(texts, size=batch_size)

model = spacy.load('en_core_web_lg')

class PoolCorpus(object):

   def __init__(self):
       self.corpus = textacy.corpus.Corpus(lang=model)

   def add_texts(self,data):
       self.corpus.add_texts(data,batch_size)

   def get_corpus(self):
       return self.corpus

BaseManager.register('PoolCorpus', PoolCorpus)

start = datetime.now()
print('started multiprocessing', start.strftime('%Y-%m-%d %H:%M:%S'))

with BaseManager() as manager:
    corpus_class = manager.PoolCorpus()

    Parallel(n_jobs=-1, backend="multiprocessing", prefer="processes", verbose=5)(
             delayed(corpus_class.add_texts)(part) for part in partitions)

    corpus=corpus_class.get_corpus()

print(corpus)

finish = datetime.now()
print('finished multiprocessing', finish.strftime('%Y-%m-%d %H:%M:%S'))
print('##### multiprocessing time',finish - start)


del corpus_class
del corpus
del model

# Now the same thing using the batch add_texts and no multiprocessing
model = spacy.load('en_core_web_lg')

start = datetime.now()
print('started sequential', start.strftime('%Y-%m-%d %H:%M:%S'))

corpus = textacy.corpus.Corpus(lang=model)
corpus.add_texts(texts, batch_size)

print(corpus)

finish = datetime.now()
print('finished sequential', finish.strftime('%Y-%m-%d %H:%M:%S'))
print('#### sequential time',finish - start)`

bdewilde · 2019-11-05T17:31:02Z

Hi @Motorrat , thanks for the detailed posts, and apologies for the belated reply. SpaCy recently (re-)implemented multiprocessing in its core nlp.pipe() functionality: https:/explosion/spaCy/releases/tag/v2.2.2

I'm going to do some work to make textacy compatible with v2.2, and that will likely include parallel processing of documents. Will keep this issue open until I've released a fix.

bdewilde · 2019-12-30T19:36:01Z

Heads-up: #285

bdewilde · 2020-03-01T23:19:45Z

The update is now available in a release: https:/chartbeat-labs/textacy/releases/tag/0.10.0

Motorrat added the enhancement label Sep 29, 2019

Motorrat changed the title ~~multythreading for~~ multythreading for Corpus creation Sep 29, 2019

Motorrat changed the title ~~multythreading for Corpus creation~~ multithreading for Corpus creation Sep 29, 2019

Motorrat closed this as completed Sep 29, 2019

Motorrat reopened this Sep 29, 2019

Motorrat mentioned this issue Nov 7, 2019

multiprocessing pipe (#1303) explosion/spaCy#4371

Merged

3 tasks

bdewilde closed this as completed Mar 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multithreading for Corpus creation #277

multithreading for Corpus creation #277

Motorrat commented Sep 29, 2019 •

edited

Loading

Motorrat commented Sep 29, 2019

Motorrat commented Oct 12, 2019 •

edited

Loading

bdewilde commented Nov 5, 2019

bdewilde commented Dec 30, 2019 •

edited

Loading

bdewilde commented Mar 1, 2020

multithreading for Corpus creation #277

multithreading for Corpus creation #277

Comments

Motorrat commented Sep 29, 2019 • edited Loading

context

proposed solution

alternative solutions?

Motorrat commented Sep 29, 2019

Motorrat commented Oct 12, 2019 • edited Loading

bdewilde commented Nov 5, 2019

bdewilde commented Dec 30, 2019 • edited Loading

bdewilde commented Mar 1, 2020

Motorrat commented Sep 29, 2019 •

edited

Loading

Motorrat commented Oct 12, 2019 •

edited

Loading

bdewilde commented Dec 30, 2019 •

edited

Loading