Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Having an error with Entity Linking #4000

Closed
almoslmi opened this issue Jul 22, 2019 · 6 comments
Closed

Having an error with Entity Linking #4000

almoslmi opened this issue Jul 22, 2019 · 6 comments
Assignees
Labels
enhancement Feature requests and improvements feat / nel Feature: Named Entity linking

Comments

@almoslmi
Copy link

I have run NEL (wikidata_entity_linking.py) and was working well after fixing some issues, but when it successfully finished processing the two dumps of Wikipedia and Wikidata, it failed to write the KB and got an AssertionError. All necessary libs are installed and added to the environment too. Any idea how to solve this issue, please?
By the way, I tried to debug the file that causes the problem (kb.pyx) but didn't work.

I am using Pycharm, and Here is the last lines of the log

4 1907 0.07248597717285156
4 1908 0.07176494598388672
Trained on 9540645 entities across 5 epochs
Final loss: 0.07176494598388672

  • get entity embeddings 2019-07-20 18:33:18.329715

  • adding 1908129 entities 2019-07-20 19:30:44.652900

  • adding aliases 2019-07-20 19:31:01.172724

kb size: 1908129 1908129 2830524
done with kb 2019-07-20 19:32:18.446255
kb entities: 1908129
kb aliases: 2830524

STEP 3b: write KB and NLP 2019-07-20 19:32:36.823641
Traceback (most recent call last):
File "C:/Users/tareq/PycharmProjects/spaCy/examples/pipeline/wikidata_entity_linking.py", line 441, in
run_pipeline()
File "C:/Users/tareq/PycharmProjects/spaCy/examples/pipeline/wikidata_entity_linking.py", line 111, in run_pipeline
kb_1.dump(KB_FILE)
File "kb.pyx", line 208, in spacy.kb.KnowledgeBase.dump
File "kb.pyx", line 346, in spacy.kb.Writer.init
AssertionError

Process finished with exit code 1

========================

Later, I tried to debug the file that causes the error (kb.pyx) and got this error:

C:\Users\tareq\Anaconda3\python.exe "C:\Program Files\JetBrains\PyCharm 2019.1.3\helpers\pydev\pydevconsole.py" --mode=client --port=52781
import sys; print('Python %s on %s' % (sys.version, sys.platform))
sys.path.extend(['C:\Users\tareq\PycharmProjects\spaCy', 'C:\Users\tareq\PycharmProjects\spaCy\spacy', 'C:/Users/tareq/PycharmProjects/spaCy'])
Python 3.7.1 (default, Dec 10 2018, 22:54:23) [MSC v.1915 64 bit (AMD64)]
Type 'copyright', 'credits' or 'license' for more information
IPython 7.6.1 -- An enhanced Interactive Python. Type '?' for help.
PyDev console: using IPython 7.6.1
Python 3.7.1 (default, Dec 10 2018, 22:54:23) [MSC v.1915 64 bit (AMD64)] on win32
runfile('C:/Users/tareq/PycharmProjects/spaCy/spacy/kb.pyx', wdir='C:/Users/tareq/PycharmProjects/spaCy/spacy')
Traceback (most recent call last):
File "C:\Users\tareq\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 3325, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "", line 1, in
runfile('C:/Users/tareq/PycharmProjects/spaCy/spacy/kb.pyx', wdir='C:/Users/tareq/PycharmProjects/spaCy/spacy')
File "C:\Program Files\JetBrains\PyCharm 2019.1.3\helpers\pydev_pydev_bundle\pydev_umd.py", line 197, in runfile
pydev_imports.execfile(filename, global_vars, local_vars) # execute the script
File "C:\Program Files\JetBrains\PyCharm 2019.1.3\helpers\pydev_pydev_imps_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "C:/Users/tareq/PycharmProjects/spaCy/spacy/kb.pyx", line 7
from cymem.cymem cimport Pool
^
SyntaxError: invalid syntax

==============Environment
spaCy version: 2.1.6
Platform: Windows-10-10.0.17763-SP0
Python version: 3.7.1

@svlandeg svlandeg added the feat / nel Feature: Named Entity linking label Jul 22, 2019
@svlandeg
Copy link
Member

Could it be that the path you specified to store the kb, doesn't exist? I think the Writer currently doesn't create the necessary directories if they don't exist. I can make that change, but in the meantime, could you try making sure the path exists on your end and check if it runs properly then?

@svlandeg
Copy link
Member

If you want to experiment faster, you can change the call to read_wikidata_entities_json by setting limit=10000 orso to only process the first 10K lines of the wikidata dump. That should make the processing much faster (and result in a much smaller KB, ofcourse)

@almoslmi
Copy link
Author

Thanks @svlandeg for your kind and prompt reply. I really appreciate your effort.

Actually, I noticed that it doesn't create the necessary directories, hence that was the first thing I have done. In practical, I had no problem with Wikipedia and all CSV files have been created. The problem appears when it complete Wikidata processing as you can notice in the log. Thanks for the tip to limit the processing lines. It will save my time and allow me to debug it faster, but would you please explain where and how I can change it.
By the way, when I tried to open any .pyx and .pxd files in Pycharm they don't recognise some related libs such as Cymem and Preshed and hence appears as unknown, although they are installed.

@svlandeg
Copy link
Member

svlandeg commented Jul 22, 2019

You can change the call to read_wikidata_entities_json in kb_creator.py.

Actually, if all your files have been preprocessed as you said, and it looks like that went fine from the logs, you can make an even better shortcut in kb_creator.py : set read_raw_data to False and it will read the CSV files directly instead of recreating them.

UPDATE: you can also set MIN_ENTITY_FREQ = 2000 in wikidata_entity_linking.py to get a much smaller KB.

In PyCharm, I often get these warnings and errors about unknown imports, as well. You can probably just ignore these.

@svlandeg svlandeg self-assigned this Jul 22, 2019
@svlandeg svlandeg added the enhancement Feature requests and improvements label Jul 22, 2019
svlandeg added a commit to svlandeg/spaCy that referenced this issue Jul 22, 2019
@almoslmi
Copy link
Author

almoslmi commented Jul 22, 2019

Many thanks, dear @svlandeg again for your help. The tips worked well and I managed to run the program again. This time I got another error related to processing some articles, like the sample shown below. The number of articles having this error is fixed to a certain number, but the program doesn't stop after reporting the error and it keeps showing this error all the time as it is in a loop, for the same articles and never stop unless I stop it.

====================
STEP 5: create training dataset 2019-07-23 01:41:01.119395
2019-07-23 01:41:01.136349 processed 0 lines of Wikipedia dump
Error processing article 12 Anarchism unsupported operand type(s) for +: 'WindowsPath' and 'str'
Error processing article 25 Autism unsupported operand type(s) for +: 'WindowsPath' and 'str'
Error processing article 39 Albedo unsupported operand type(s) for +: 'WindowsPath' and 'str'
Error processing article 290 A unsupported operand type(s) for +: 'WindowsPath' and 'str'
Error processing article 303 Alabama unsupported operand type(s) for +: 'WindowsPath' and 'str'
Error processing article 305 Achilles unsupported operand type(s) for +: 'WindowsPath' and 'str'
Error processing article 307 Abraham Lincoln unsupported operand type(s) for +: 'WindowsPath' and 'str'
Error processing article 308 Aristotle unsupported operand type(s) for +: 'WindowsPath' and 'str'

polm pushed a commit to polm/spaCy that referenced this issue Aug 18, 2019
@lock
Copy link

lock bot commented Aug 22, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Aug 22, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement Feature requests and improvements feat / nel Feature: Named Entity linking
Projects
None yet
Development

No branches or pull requests

2 participants