wiki_train_entity_linker hanging on step 3 #5131

kprice-aktors · 2020-03-10T11:06:46Z

This is a follow-up on a previous memory issue.
Following the advice from that thread, and the instructions in the documentation, I installed the most recent version of spaCy and compiled it in a virtual environment. The updated scripts are supposed to address this memory issue. At first it seemed like that was fixed, but it looks like the script is hanging on step 3, when processing the dev data:

(.env) user@server:~$ python3 ../.local/lib/python3.7/site-packages/spaCy/bin/wiki_entity_linking/wikidata_train_entity_linker.py -o models/en knowledge-base/en -t 80000 -d 20000 -l CARDINAL,DATE,MONEY,ORDINAL,QUANTITY,TIME,PERCENT
2020-03-10 12:37:13,099 - INFO - __main__ - Creating Entity Linker with Wikipedia and WikiData
2020-03-10 12:37:13,099 - INFO - __main__ - STEP 1a: Loading model from knowledge-base/en/nlp_kb
2020-03-10 12:37:44,596 - INFO - __main__ - Original NLP pipeline has following pipeline components: ['tagger', 'parser', 'ner']
2020-03-10 12:37:44,597 - INFO - __main__ - STEP 1b: Loading KB from knowledge-base/en/kb
2020-03-10 12:37:57,864 - INFO - __main__ - STEP 2: Reading training & dev dataset from knowledge-base/en/gold_entities.jsonl
2020-03-10 12:46:48,019 - INFO - __main__ - Training set has 6099890 articles, limit set to roughly 80000 articles per epoch
2020-03-10 12:46:48,019 - INFO - __main__ - Dev set has 678808 articles, limit set to roughly 20000 articles for evaluation
2020-03-10 12:46:48,049 - INFO - __main__ - STEP 3: Creating and training an Entity Linking pipe for 10 epochs
2020-03-10 12:46:48,049 - INFO - __main__ - Discarding 7 NER types: ['CARDINAL', 'DATE', 'MONEY', 'ORDINAL', 'QUANTITY', 'TIME', 'PERCENT']
2020-03-10 12:46:49,418 - INFO - __main__ - Dev Baseline Accuracies:
Processing dev data:   0%                                | 0/20000 [00:00<?, ?it/s]

I've tried limiting the training data and testing data as well. I don't think it's a matter of time - step 2 took around 10 minutes, but this step has been hanging for about an hour. The closest issue I can find to this seems to be stuck at a different step, so I am not really sure what could be wrong.

Info about spaCy

Python version: 2.7.17
Platform: Linux-5.3.0-40-generic-x86_64-with-Ubuntu-19.10-eoan
spaCy version: 2.2.3

EDIT: After running the script again with an even smaller dataset, it looks like it IS a matter of time, so I will be closing this issue for now

The text was updated successfully, but these errors were encountered:

svlandeg · 2020-03-10T11:14:56Z

It is definitely counter intuitive if the progress bar stays stuck at 0% for too long. I will look into this !

kprice-aktors · 2020-03-11T06:03:45Z

A minor update
I left the program running overnight, came back in and it is still progressing very slowly

svlandeg · 2020-03-11T09:27:00Z

Ok, I had another look and there's still a major inefficiency in the script, in the selection of your 8000 articles out of the pool of 6 million. Basically in short, if you "manually" create a smaller version of your gold_entities.jsonl, it should speed up significantly.

I was running this on 165.000 training articles (and 1000 dev) with 1 epoch taking less than 24 hours, so your training time is definitely too slow. You can try cutting the gold_entities.jsonl like I said. Meanwhile, I will improve the scripts on our end (sometime this week).

kprice-aktors · 2020-03-11T10:58:52Z

Thanks, cutting the jsonl file into smaller pieces seems to have sped the process up!

svlandeg · 2020-03-11T10:59:35Z

Thanks for letting me know - that confirms my suspicion!

kprice-aktors · 2020-03-12T07:33:59Z

Ok, I was able to successfully train a model for entity linking, but it seems that because of the limits I set on training and test data, it's not actually that effective at performing NEL. What would you recommend is the most practical limit to set for the training and test data when training a model which also does not consume too much time?

svlandeg · 2020-03-12T07:51:11Z

I haven't done an exhaustive search, ofcourse, but the setting of 165.000 training articles worked well for me. The dev test set can be kept small, like 1000 orso.

lock · 2020-05-05T21:32:58Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

kprice-aktors closed this as completed Mar 10, 2020

svlandeg reopened this Mar 10, 2020

svlandeg added feat / nel Feature: Named Entity linking perf / speed Performance: speed labels Mar 10, 2020

svlandeg mentioned this issue Mar 16, 2020

avoid enumerate to avoid long waiting at 0% #5159

Merged

3 tasks

honnibal closed this as completed in #5159 Apr 2, 2020

lock bot locked as resolved and limited conversation to collaborators May 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wiki_train_entity_linker hanging on step 3 #5131

wiki_train_entity_linker hanging on step 3 #5131

kprice-aktors commented Mar 10, 2020 •

edited

Loading

svlandeg commented Mar 10, 2020

kprice-aktors commented Mar 11, 2020

svlandeg commented Mar 11, 2020 •

edited

Loading

kprice-aktors commented Mar 11, 2020

svlandeg commented Mar 11, 2020

kprice-aktors commented Mar 12, 2020

svlandeg commented Mar 12, 2020

lock bot commented May 5, 2020

wiki_train_entity_linker hanging on step 3 #5131

wiki_train_entity_linker hanging on step 3 #5131

Comments

kprice-aktors commented Mar 10, 2020 • edited Loading

Info about spaCy

svlandeg commented Mar 10, 2020

kprice-aktors commented Mar 11, 2020

svlandeg commented Mar 11, 2020 • edited Loading

kprice-aktors commented Mar 11, 2020

svlandeg commented Mar 11, 2020

kprice-aktors commented Mar 12, 2020

svlandeg commented Mar 12, 2020

lock bot commented May 5, 2020

kprice-aktors commented Mar 10, 2020 •

edited

Loading

svlandeg commented Mar 11, 2020 •

edited

Loading