Need clarity on which component in default pipeline modifies lemma_ on Doc and need suggestions for improving spacy throughput #2678

vvivek921 · 2018-08-16T09:12:47Z

2 questions:

I am using ' '.join([word.lemma_ for word in nlp(word_chunk)]) to lemmatise a word chunk. I want to reduce the latency of nlp() method call. I am considering disabling some components of the nlp pipeline. This doc doesn't mention which component modifies Doc[i].lemma_ attribute so that I can just retain those components in my pipeline and discard the later components.
We are using py spark and spacy for parallelly processing large amounts of text data. But still we are not happy with the throughput. I am trying to reduce spacy call latencies to improve throughput. What else do you guys suggest? Should I explore using nlp.pipe over nlp in our py spark cluster? Or should I consider running pyspark with gpu as spacy is a pipeline of neural network predictions?

Detailed elaboration of point 1(Optional Read)
I will start by giving some background on the problem I am trying to solve.
We are using spacy to process large amounts of text data to extract insights from it.
We are using spacy with its default pipeline for multiple tasks:

identifying different sentences in a text using Doc.sents
Given a word, identifying it's lemma using ' '.join([word.lemma_ for word in SpacyUtils.nlp(word_chunk)])
Doc.noun_chunks to get noun chunks in a sentence
Doc[i].tag to get POS tag.
Doc[i].dep to get dependency parsing of a sentence.

For each of the 5 tasks above, I am using spacy with default pipeline. I want to alter the pipeline for each of the tasks above to reduce time taken for each of the tasks.

This document neatly summarises which components of default pipeline creates which attributes in Doc object.
My strategy is to only retain parts of pipeline which are relevant for the task.

For tasks 1,3,5 which require sents,noun_chunks and dep, I am retaining tagger, parser and discarding ner.
For tasks 4, I am retaining tagger alone.
For task 2, which is the most frequently used task, i am not able to decide which part of pipelines to retain as it's not clearly mentioned which component creates attribute Doc[i].lemma_

vvivek921 · 2018-08-19T04:42:38Z

any pointers please?!

thomasopsomer · 2018-08-21T00:25:00Z

Hey,
For point 1. You should just keep the tagger as the lemmatizer benefits from knowing the POS, so if you don't need entities or dependencies you can remove the ner and parser.

For point 2. In my experience the best is to use spark mapPartitions method, load the nlp object inside each partition and use nlp.pipe() on the records of the partition. You can prevent each nlp object to spawn many threads by setting the environment variable OPENBLAS_NUM_THREADS=1. However this is not perfect because it needs to reload the models at each partition which takes some time, but if you use large partitions it's still worth it...

vvivek921 · 2018-08-21T07:21:11Z

I would have to read up few things to better understand your reply for point2. Let me get back to you.

arvind-ravichandran · 2018-08-22T10:15:26Z

@thomasopsomer Is it possible to share the spacy across python workers within a worker node/executor by any chance and thereby reducing the memory footprint of it?

honnibal · 2018-08-22T10:48:53Z

@vvivek921 Sorry for the delay replying.

I'll answer your specific questions first, but I also think from your elaboration you're probably doing this the wrong way.

For just the lemmas, you need the tagger, but not the parser or NER. So the following should work: spacy = nlp.load('en_core_web_sm', disable=['parser', 'ner'])
Try not to do something like spark.map(nlp, lots_of_text). This deserializes the nlp object on each call. Instead try to ask Spark to do as little as possible. Something like this:

def write_lemmas(model_path, input_dir, output_dir, filename):
    # Load model, read text, write output
    texts = read_texts(input_dir, filename)
    for doc in nlp.pipe(texts):
        write_output(doc, output_dir, filename)

spark.map(functools.partial(write_lemmas, model_path, input_dir, output_dir), filenames)

This just uses Spark to execute your work on lots of workers, but gives you control of how the work is batched up.

An alternative approach is to figure out how to get your workers to pre-load the NLP object. Then you can pipe text to and from them. However, you should usually be able to organise the work so that this isn't really necessary.

Loading the NLP object takes like 30s when you have vectors (we'd like to reduce this). If you can batch up your work so that each chunk takes 20 minutes or more, the loading overhead becomes insignificant.

For each of the 5 tasks above, I am using spacy with default pipeline. I want to alter the pipeline for each of the tasks above to reduce time taken for each of the tasks.

If this means parsing documents multiple times, you might be worse off doing this. It might be better just to parse the text once, and get what you want from it.

lock · 2018-09-25T17:05:12Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

honnibal added the usage General spaCy usage label Aug 26, 2018

honnibal closed this as completed Aug 26, 2018

lock bot locked as resolved and limited conversation to collaborators Sep 25, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Need clarity on which component in default pipeline modifies lemma_ on Doc and need suggestions for improving spacy throughput #2678

Need clarity on which component in default pipeline modifies lemma_ on Doc and need suggestions for improving spacy throughput #2678

vvivek921 commented Aug 16, 2018

vvivek921 commented Aug 19, 2018

thomasopsomer commented Aug 21, 2018

vvivek921 commented Aug 21, 2018

arvind-ravichandran commented Aug 22, 2018

honnibal commented Aug 22, 2018 •

edited

Loading

lock bot commented Sep 25, 2018

Need clarity on which component in default pipeline modifies lemma_ on Doc and need suggestions for improving spacy throughput #2678

Need clarity on which component in default pipeline modifies lemma_ on Doc and need suggestions for improving spacy throughput #2678

Comments

vvivek921 commented Aug 16, 2018

vvivek921 commented Aug 19, 2018

thomasopsomer commented Aug 21, 2018

vvivek921 commented Aug 21, 2018

arvind-ravichandran commented Aug 22, 2018

honnibal commented Aug 22, 2018 • edited Loading

lock bot commented Sep 25, 2018

honnibal commented Aug 22, 2018 •

edited

Loading