-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Entity linking using Wikipedia & Wikidata #3864
Conversation
…n one doc (for now)
Thanks for the detailed reviews! I'm working on addressing them. Additionally, I just found that |
Update @ines @honnibal : I've addressed most comments - feel free to review again. It still needs to be adjusted & tested on GPU though. I'm happy to look into that, though it will take me a bit longer than when you do Matt, but I don't mind either way. The main thing that changed since the publication of this PR is that originally, the model was looking at cosine similarity between the context (sentence) and entity (description) encoders, but now there's an additional EL model that takes the input of the context & entity encoders + additional features such as prior probability and NER types, and runs a net on top of that. This is a more generic setup and will allow for more flexibility, but it's a little awkward how that currently results in two models being trained in the |
Okay, I'll go ahead and merge this 🎉 It'll still be inofficial and undocumented, but it'd allow people who want to test it to do so. We can then keep adding to it until it's ready for "official" release 🙂 |
I have successfully trained a spacy entity linking model(obviously by limiting the data). my question is how to display the description of entity from kb as output? import spacy |
Hi @zainbnv. The descriptions are currently not stored in the KB itself because of performance reasons. However, from the intermediary results during processing, you should have a file |
Implementation of the Entity Linker (cf. #Issue #3339) using Wikidata entities and Wikipedia training. Focus of this PR is on the general pipeline - further performance improvements can certainly be made.
Note: this PR temporarily reverts this edit as it broke the parsing by
en_core_web_lg
.Description
Added core functionality
links
property with (start_char, end_char, gold_kb_id) tuplesTypes of change
New feature
General flow & implementation with Wikipedia & WikiData
create KB: vocab + entities (freq, entity vector) + aliases (prior prob)
obtain training data: list of (
Doc
,GoldParse
) tuples,GoldParse
.links
contains (start_char, end_char, gold_entity) tuplesen_core_web_lg
NERents
, store the sentence as local contextrun training of the pipe similar to training NER or textcat:
Doc
through a CNN and outputs 64D vectors that are compared to the entity vectorsents
to their predicted entity IDsFile overview
in
examples/pipeline/
:wikidata_entity_linking.py
: shows the full pipeline for running the Entity Linking functionalityin
/bin
:wikidata_processor.py
: Parse the JSON wiki data to get all entities (takes about 7h to parse 55M lines)wikipedia_processor.py
: Process the XML Wikipedia dump to calculate entity frequencies and prior probabilities (takes about 2h to parse 1100M lines)training_set_creator.py
: Process the XML Wikipedia dump to parse out raw text + gold-standard entities in offset format per article (takes about 15h to parse 60M articles on 1100M lines)train_descriptions.py
: Train the embeddings of entity descriptions to fit a fixed-size entity vector (e.g. 64D).kb_creator.py
: Pipeline to create the knowledge base from Wikidata entriesin
tests/
:pipeline/test_entity_linker.py
: test correct & incorrect usage of the KBserialize/test_serialize_kb.py
: test correct writing & reading of KBFirst results
With 1.1M entities in the KB (14% of all), "oracle" KB accuracy is 84.2% on a dev set of 5K entities, and prior probability by itself gives 78.2% accuracy. Adding sentence context improves this marginally, but consistently. Training on 200K entities for 2h, the context encoder (by itself, but limited by the KB) gets to 73.9% accuracy and brings the combined accuracy with the prior probabilities to 79.0%. The "pick one at random" baseline on the other hand would only achieve around 54%, so the context encoder is doing relatively well on its own.
I experimented with more complex models than the ones in this PR, including e.g. an additional document encoder, but I figured that would bias the results too much towards the typical structure of Wikipedia documents with 2-3 introductory sentences. And preliminary tests with other models didn't improve the accuracy.
Beyond this PR
There's multiple ways of further building on top of this result:
Example output
In The Hitchhiker's Guide to the Galaxy, written by Douglas Adams, Douglas reminds us to always bring our towel, even in China or Brazil. The main character in Doug's novel is the man Arthur Dent, but Douglas doesn't write about George Washington or Homer Simpson.
Checklist