Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wiki_train_entity_linker hanging on step 3 #5131

Closed
kprice-aktors opened this issue Mar 10, 2020 · 8 comments · Fixed by #5159
Closed

wiki_train_entity_linker hanging on step 3 #5131

kprice-aktors opened this issue Mar 10, 2020 · 8 comments · Fixed by #5159
Labels
feat / nel Feature: Named Entity linking perf / speed Performance: speed

Comments

@kprice-aktors
Copy link

kprice-aktors commented Mar 10, 2020

This is a follow-up on a previous memory issue.
Following the advice from that thread, and the instructions in the documentation, I installed the most recent version of spaCy and compiled it in a virtual environment. The updated scripts are supposed to address this memory issue. At first it seemed like that was fixed, but it looks like the script is hanging on step 3, when processing the dev data:

(.env) user@server:~$ python3 ../.local/lib/python3.7/site-packages/spaCy/bin/wiki_entity_linking/wikidata_train_entity_linker.py -o models/en knowledge-base/en -t 80000 -d 20000 -l CARDINAL,DATE,MONEY,ORDINAL,QUANTITY,TIME,PERCENT
2020-03-10 12:37:13,099 - INFO - __main__ - Creating Entity Linker with Wikipedia and WikiData
2020-03-10 12:37:13,099 - INFO - __main__ - STEP 1a: Loading model from knowledge-base/en/nlp_kb
2020-03-10 12:37:44,596 - INFO - __main__ - Original NLP pipeline has following pipeline components: ['tagger', 'parser', 'ner']
2020-03-10 12:37:44,597 - INFO - __main__ - STEP 1b: Loading KB from knowledge-base/en/kb
2020-03-10 12:37:57,864 - INFO - __main__ - STEP 2: Reading training & dev dataset from knowledge-base/en/gold_entities.jsonl
2020-03-10 12:46:48,019 - INFO - __main__ - Training set has 6099890 articles, limit set to roughly 80000 articles per epoch
2020-03-10 12:46:48,019 - INFO - __main__ - Dev set has 678808 articles, limit set to roughly 20000 articles for evaluation
2020-03-10 12:46:48,049 - INFO - __main__ - STEP 3: Creating and training an Entity Linking pipe for 10 epochs
2020-03-10 12:46:48,049 - INFO - __main__ - Discarding 7 NER types: ['CARDINAL', 'DATE', 'MONEY', 'ORDINAL', 'QUANTITY', 'TIME', 'PERCENT']
2020-03-10 12:46:49,418 - INFO - __main__ - Dev Baseline Accuracies:
Processing dev data:   0%                                | 0/20000 [00:00<?, ?it/s]

I've tried limiting the training data and testing data as well. I don't think it's a matter of time - step 2 took around 10 minutes, but this step has been hanging for about an hour. The closest issue I can find to this seems to be stuck at a different step, so I am not really sure what could be wrong.

Info about spaCy

  • Python version: 2.7.17
  • Platform: Linux-5.3.0-40-generic-x86_64-with-Ubuntu-19.10-eoan
  • spaCy version: 2.2.3

EDIT: After running the script again with an even smaller dataset, it looks like it IS a matter of time, so I will be closing this issue for now

@svlandeg
Copy link
Member

It is definitely counter intuitive if the progress bar stays stuck at 0% for too long. I will look into this !

@svlandeg svlandeg reopened this Mar 10, 2020
@svlandeg svlandeg added feat / nel Feature: Named Entity linking perf / speed Performance: speed labels Mar 10, 2020
@kprice-aktors
Copy link
Author

A minor update
I left the program running overnight, came back in and it is still progressing very slowly

Screenshot (6)

@svlandeg
Copy link
Member

svlandeg commented Mar 11, 2020

Ok, I had another look and there's still a major inefficiency in the script, in the selection of your 8000 articles out of the pool of 6 million. Basically in short, if you "manually" create a smaller version of your gold_entities.jsonl, it should speed up significantly.

I was running this on 165.000 training articles (and 1000 dev) with 1 epoch taking less than 24 hours, so your training time is definitely too slow. You can try cutting the gold_entities.jsonl like I said. Meanwhile, I will improve the scripts on our end (sometime this week).

@kprice-aktors
Copy link
Author

Thanks, cutting the jsonl file into smaller pieces seems to have sped the process up!

@svlandeg
Copy link
Member

Thanks for letting me know - that confirms my suspicion!

@kprice-aktors
Copy link
Author

Ok, I was able to successfully train a model for entity linking, but it seems that because of the limits I set on training and test data, it's not actually that effective at performing NEL. What would you recommend is the most practical limit to set for the training and test data when training a model which also does not consume too much time?

@svlandeg
Copy link
Member

I haven't done an exhaustive search, ofcourse, but the setting of 165.000 training articles worked well for me. The dev test set can be kept small, like 1000 orso.

@lock
Copy link

lock bot commented May 5, 2020

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 5, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feat / nel Feature: Named Entity linking perf / speed Performance: speed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants