-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Need clarity on which component in default pipeline modifies lemma_ on Doc and need suggestions for improving spacy throughput #2678
Comments
any pointers please?! |
Hey, For point 2. In my experience the best is to use spark |
I would have to read up few things to better understand your reply for point2. Let me get back to you. |
@thomasopsomer Is it possible to share the spacy across python workers within a worker node/executor by any chance and thereby reducing the memory footprint of it? |
@vvivek921 Sorry for the delay replying. I'll answer your specific questions first, but I also think from your elaboration you're probably doing this the wrong way.
def write_lemmas(model_path, input_dir, output_dir, filename):
# Load model, read text, write output
texts = read_texts(input_dir, filename)
for doc in nlp.pipe(texts):
write_output(doc, output_dir, filename)
spark.map(functools.partial(write_lemmas, model_path, input_dir, output_dir), filenames) This just uses Spark to execute your work on lots of workers, but gives you control of how the work is batched up. An alternative approach is to figure out how to get your workers to pre-load the NLP object. Then you can pipe text to and from them. However, you should usually be able to organise the work so that this isn't really necessary. Loading the NLP object takes like 30s when you have vectors (we'd like to reduce this). If you can batch up your work so that each chunk takes 20 minutes or more, the loading overhead becomes insignificant.
If this means parsing documents multiple times, you might be worse off doing this. It might be better just to parse the text once, and get what you want from it. |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
2 questions:
I am using ' '.join([word.lemma_ for word in nlp(word_chunk)]) to lemmatise a word chunk. I want to reduce the latency of nlp() method call. I am considering disabling some components of the nlp pipeline. This doc doesn't mention which component modifies Doc[i].lemma_ attribute so that I can just retain those components in my pipeline and discard the later components.
We are using py spark and spacy for parallelly processing large amounts of text data. But still we are not happy with the throughput. I am trying to reduce spacy call latencies to improve throughput. What else do you guys suggest? Should I explore using nlp.pipe over nlp in our py spark cluster? Or should I consider running pyspark with gpu as spacy is a pipeline of neural network predictions?
Detailed elaboration of point 1(Optional Read)
I will start by giving some background on the problem I am trying to solve.
We are using spacy to process large amounts of text data to extract insights from it.
We are using spacy with its default pipeline for multiple tasks:
For each of the 5 tasks above, I am using spacy with default pipeline. I want to alter the pipeline for each of the tasks above to reduce time taken for each of the tasks.
This document neatly summarises which components of default pipeline creates which attributes in Doc object.
My strategy is to only retain parts of pipeline which are relevant for the task.
For tasks 1,3,5 which require sents,noun_chunks and dep, I am retaining tagger, parser and discarding ner.
For tasks 4, I am retaining tagger alone.
For task 2, which is the most frequently used task, i am not able to decide which part of pipelines to retain as it's not clearly mentioned which component creates attribute Doc[i].lemma_
The text was updated successfully, but these errors were encountered: