Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch processing doesn't speed up? #4935

Closed
lingvisa opened this issue Jan 22, 2020 · 7 comments
Closed

Batch processing doesn't speed up? #4935

lingvisa opened this issue Jan 22, 2020 · 7 comments
Labels
lang / zh Chinese language data and models perf / speed Performance: speed

Comments

@lingvisa
Copy link

lingvisa commented Jan 22, 2020

I loaded 300000 line of Chinese texts, each line is like a tweet length. The compared the batch and non-batch mode to compare its speed:

Batch:

def process_text_with_batch(text_lines):
    line_number = 0
    print("Start to process: ")
    for doc in nlp.pipe(text_lines):
        line_number += 1
        print(line_number)

Non batch:

def process_text(text_lines):
    line_number = 0
    print("Start to process: ")
    for text in text_lines:
        doc = nlp(text)
        line_number += 1
        print(line_number)

Where 'text_lines' is an array. The two ways finish almost the same time consumed: 2.5 and 2.6 minutes. why doesn't it speed up? spaCy is using Jieba segmentation. Is possible that Jieba doesn't speed up?

@svlandeg svlandeg added lang / zh Chinese language data and models perf / speed Performance: speed labels Jan 23, 2020
@adrianeboyd
Copy link
Contributor

Hi, pipe() improves the speed for the statistical models like the tagger or parser, but spacy doesn't try to do any batching for the tokenization step, so you won't see any difference here.

This design makes sense for languages that use spacy's built-in rule-based tokenizer, but might become more of a bottleneck for languages that use slower external tokenizers / word segmenters. We'll have to keep this in mind for the future, but for now, just pipe() on its own isn't going to be faster for spacy's default Chinese pipeline.

I would suggest trying multiprocessing with pipe(n_process=N) (n_process=-1 if you want to set it to multiprocessing.cpu_count()). There's a bit of overhead to multiprocessing so it'll be slower for smaller numbers of documents, but I think it should help in your case.

@lingvisa
Copy link
Author

Hi, Adrian:

It indeed speed up a lot. However, it fails when I add a custom sentence segmenter into the pipeline:

from spacy.pipeline import Sentencizer
nlp = Chinese()
sentencizer = Sentencizer(punct_chars=["。", "!", "?", ";", "!", "?"])
nlp.add_pipe(sentencizer)
...
for doc in nlp.pipe(text_lines, n_process=-1):
   print(doc)

If I suspend the nlp.add_pipe(), it is fine; otherwise, it reports the error:

Process Process-2:
Traceback (most recent call last):
  File "/usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/cong/spacy/language.py", line 1124, in _apply_pipes
    sender.send([doc.to_bytes() for doc in docs])
  File "/Users/congmin/spacy/language.py", line 1124, in <listcomp>
    sender.send([doc.to_bytes() for doc in docs])
  File "pipes.pyx", line 1481, in pipe
  File "pipes.pyx", line 1498, in spacy.pipeline.pipes.Sentencizer.predict
IndexError: list assignment index out of range

What may cause this?

@adrianeboyd
Copy link
Contributor

That looks like a bug, I'll take a closer look tomorrow.

@lingvisa
Copy link
Author

Hi, Adrian: Just a reminder that it doesn't fail immediately. It fails on certain amount of data or certain sentences. I tested on one file with 70000 sentences, and it went all right. On another file it reported the error message when processing less than 3000 sentences. It might be caused by concurrent processing or some other causes. When I disabled the custom segmenter, it went all right.

@adrianeboyd
Copy link
Contributor

The bug is related to processing an empty text/doc. See #4935, which should fix it.

@lingvisa
Copy link
Author

It worked and thanks.

@lock
Copy link

lock bot commented Feb 23, 2020

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Feb 23, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
lang / zh Chinese language data and models perf / speed Performance: speed
Projects
None yet
Development

No branches or pull requests

3 participants