Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need clarity on which component in default pipeline modifies lemma_ on Doc and need suggestions for improving spacy throughput #2678

Closed
vvivek921 opened this issue Aug 16, 2018 · 6 comments
Labels
usage General spaCy usage

Comments

@vvivek921
Copy link

2 questions:

  1. I am using ' '.join([word.lemma_ for word in nlp(word_chunk)]) to lemmatise a word chunk. I want to reduce the latency of nlp() method call. I am considering disabling some components of the nlp pipeline. This doc doesn't mention which component modifies Doc[i].lemma_ attribute so that I can just retain those components in my pipeline and discard the later components.

  2. We are using py spark and spacy for parallelly processing large amounts of text data. But still we are not happy with the throughput. I am trying to reduce spacy call latencies to improve throughput. What else do you guys suggest? Should I explore using nlp.pipe over nlp in our py spark cluster? Or should I consider running pyspark with gpu as spacy is a pipeline of neural network predictions?

Detailed elaboration of point 1(Optional Read)
I will start by giving some background on the problem I am trying to solve.
We are using spacy to process large amounts of text data to extract insights from it.
We are using spacy with its default pipeline for multiple tasks:

  1. identifying different sentences in a text using Doc.sents
  2. Given a word, identifying it's lemma using ' '.join([word.lemma_ for word in SpacyUtils.nlp(word_chunk)])
  3. Doc.noun_chunks to get noun chunks in a sentence
  4. Doc[i].tag to get POS tag.
  5. Doc[i].dep to get dependency parsing of a sentence.

For each of the 5 tasks above, I am using spacy with default pipeline. I want to alter the pipeline for each of the tasks above to reduce time taken for each of the tasks.

This document neatly summarises which components of default pipeline creates which attributes in Doc object.
My strategy is to only retain parts of pipeline which are relevant for the task.

For tasks 1,3,5 which require sents,noun_chunks and dep, I am retaining tagger, parser and discarding ner.
For tasks 4, I am retaining tagger alone.
For task 2, which is the most frequently used task, i am not able to decide which part of pipelines to retain as it's not clearly mentioned which component creates attribute Doc[i].lemma_

@vvivek921
Copy link
Author

any pointers please?!

@thomasopsomer
Copy link
Contributor

Hey,
For point 1. You should just keep the tagger as the lemmatizer benefits from knowing the POS, so if you don't need entities or dependencies you can remove the ner and parser.

For point 2. In my experience the best is to use spark mapPartitions method, load the nlp object inside each partition and use nlp.pipe() on the records of the partition. You can prevent each nlp object to spawn many threads by setting the environment variable OPENBLAS_NUM_THREADS=1. However this is not perfect because it needs to reload the models at each partition which takes some time, but if you use large partitions it's still worth it...

@vvivek921
Copy link
Author

I would have to read up few things to better understand your reply for point2. Let me get back to you.

@arvind-ravichandran
Copy link

@thomasopsomer Is it possible to share the spacy across python workers within a worker node/executor by any chance and thereby reducing the memory footprint of it?

@honnibal
Copy link
Member

honnibal commented Aug 22, 2018

@vvivek921 Sorry for the delay replying.

I'll answer your specific questions first, but I also think from your elaboration you're probably doing this the wrong way.

  1. For just the lemmas, you need the tagger, but not the parser or NER. So the following should work: spacy = nlp.load('en_core_web_sm', disable=['parser', 'ner'])

  2. Try not to do something like spark.map(nlp, lots_of_text). This deserializes the nlp object on each call. Instead try to ask Spark to do as little as possible. Something like this:

def write_lemmas(model_path, input_dir, output_dir, filename):
    # Load model, read text, write output
    texts = read_texts(input_dir, filename)
    for doc in nlp.pipe(texts):
        write_output(doc, output_dir, filename)

spark.map(functools.partial(write_lemmas, model_path, input_dir, output_dir), filenames)

This just uses Spark to execute your work on lots of workers, but gives you control of how the work is batched up.

An alternative approach is to figure out how to get your workers to pre-load the NLP object. Then you can pipe text to and from them. However, you should usually be able to organise the work so that this isn't really necessary.

Loading the NLP object takes like 30s when you have vectors (we'd like to reduce this). If you can batch up your work so that each chunk takes 20 minutes or more, the loading overhead becomes insignificant.

For each of the 5 tasks above, I am using spacy with default pipeline. I want to alter the pipeline for each of the tasks above to reduce time taken for each of the tasks.

If this means parsing documents multiple times, you might be worse off doing this. It might be better just to parse the text once, and get what you want from it.

@honnibal honnibal added the usage General spaCy usage label Aug 26, 2018
@lock
Copy link

lock bot commented Sep 25, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Sep 25, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
usage General spaCy usage
Projects
None yet
Development

No branches or pull requests

4 participants