Batch processing doesn't speed up? #4935

lingvisa · 2020-01-22T18:53:42Z

I loaded 300000 line of Chinese texts, each line is like a tweet length. The compared the batch and non-batch mode to compare its speed:

Batch:

def process_text_with_batch(text_lines):
    line_number = 0
    print("Start to process: ")
    for doc in nlp.pipe(text_lines):
        line_number += 1
        print(line_number)

Non batch:

def process_text(text_lines):
    line_number = 0
    print("Start to process: ")
    for text in text_lines:
        doc = nlp(text)
        line_number += 1
        print(line_number)

Where 'text_lines' is an array. The two ways finish almost the same time consumed: 2.5 and 2.6 minutes. why doesn't it speed up? spaCy is using Jieba segmentation. Is possible that Jieba doesn't speed up?

The text was updated successfully, but these errors were encountered:

adrianeboyd · 2020-01-23T12:05:46Z

Hi, pipe() improves the speed for the statistical models like the tagger or parser, but spacy doesn't try to do any batching for the tokenization step, so you won't see any difference here.

This design makes sense for languages that use spacy's built-in rule-based tokenizer, but might become more of a bottleneck for languages that use slower external tokenizers / word segmenters. We'll have to keep this in mind for the future, but for now, just pipe() on its own isn't going to be faster for spacy's default Chinese pipeline.

I would suggest trying multiprocessing with pipe(n_process=N) (n_process=-1 if you want to set it to multiprocessing.cpu_count()). There's a bit of overhead to multiprocessing so it'll be slower for smaller numbers of documents, but I think it should help in your case.

lingvisa · 2020-01-23T20:54:38Z

Hi, Adrian:

It indeed speed up a lot. However, it fails when I add a custom sentence segmenter into the pipeline:

from spacy.pipeline import Sentencizer
nlp = Chinese()
sentencizer = Sentencizer(punct_chars=["。", "！", "？", "；", "!", "?"])
nlp.add_pipe(sentencizer)
...
for doc in nlp.pipe(text_lines, n_process=-1):
   print(doc)

If I suspend the nlp.add_pipe(), it is fine; otherwise, it reports the error:

Process Process-2:
Traceback (most recent call last):
  File "/usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/cong/spacy/language.py", line 1124, in _apply_pipes
    sender.send([doc.to_bytes() for doc in docs])
  File "/Users/congmin/spacy/language.py", line 1124, in <listcomp>
    sender.send([doc.to_bytes() for doc in docs])
  File "pipes.pyx", line 1481, in pipe
  File "pipes.pyx", line 1498, in spacy.pipeline.pipes.Sentencizer.predict
IndexError: list assignment index out of range

What may cause this?

adrianeboyd · 2020-01-23T21:48:15Z

That looks like a bug, I'll take a closer look tomorrow.

lingvisa · 2020-01-23T22:35:17Z

Hi, Adrian: Just a reminder that it doesn't fail immediately. It fails on certain amount of data or certain sentences. I tested on one file with 70000 sentences, and it went all right. On another file it reported the error message when processing less than 3000 sentences. It might be caused by concurrent processing or some other causes. When I disabled the custom segmenter, it went all right.

adrianeboyd · 2020-01-24T08:05:10Z

The bug is related to processing an empty text/doc. See #4935, which should fix it.

lingvisa · 2020-01-24T18:17:06Z

It worked and thanks.

lock · 2020-02-23T19:24:19Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

svlandeg added lang / zh Chinese language data and models perf / speed Performance: speed labels Jan 23, 2020

svlandeg closed this as completed Jan 23, 2020

adrianeboyd mentioned this issue Jan 24, 2020

Fix Sentencizer.pipe() for empty doc #4940

Merged

3 tasks

lock bot locked as resolved and limited conversation to collaborators Feb 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch processing doesn't speed up? #4935

Batch processing doesn't speed up? #4935

lingvisa commented Jan 22, 2020 •

edited

Loading

adrianeboyd commented Jan 23, 2020

lingvisa commented Jan 23, 2020

adrianeboyd commented Jan 23, 2020

lingvisa commented Jan 23, 2020

adrianeboyd commented Jan 24, 2020

lingvisa commented Jan 24, 2020

lock bot commented Feb 23, 2020

Batch processing doesn't speed up? #4935

Batch processing doesn't speed up? #4935

Comments

lingvisa commented Jan 22, 2020 • edited Loading

adrianeboyd commented Jan 23, 2020

lingvisa commented Jan 23, 2020

adrianeboyd commented Jan 23, 2020

lingvisa commented Jan 23, 2020

adrianeboyd commented Jan 24, 2020

lingvisa commented Jan 24, 2020

lock bot commented Feb 23, 2020

lingvisa commented Jan 22, 2020 •

edited

Loading