Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Entity linking using Wikipedia & Wikidata #3864

Merged
merged 109 commits into from
Jul 10, 2019

Conversation

svlandeg
Copy link
Member

@svlandeg svlandeg commented Jun 19, 2019

Implementation of the Entity Linker (cf. #Issue #3339) using Wikidata entities and Wikipedia training. Focus of this PR is on the general pipeline - further performance improvements can certainly be made.

Note: this PR temporarily reverts this edit as it broke the parsing by en_core_web_lg.

Description

Added core functionality

  • KB stores entity vectors for each entity
  • KB to/from bytes + unit test
  • various improvements to KB code (nogil)
  • implementation of Entity linking pipe, using entity encoder and context encoder
  • Entity linker pipe to/from file (as part of writing nlp object)
  • GoldParse has links property with (start_char, end_char, gold_kb_id) tuples

Types of change

New feature

General flow & implementation with Wikipedia & WikiData

  • create KB: vocab + entities (freq, entity vector) + aliases (prior prob)

    • prior probabilities from intrawiki links on Wikipedia
    • total of 7.9M entities/titles filtered down to 1.1M by min. 20 incoming intrawiki links for an entity
    • trained entity encodings with simple encoder-decoder self-supervision to 64D vectors
    • alias defined by max. 10 candidates per entity and min. 5 occurrences of an alias-entity pair: 1.6M aliases
    • the result is a 345MB KB
  • obtain training data: list of (Doc, GoldParse) tuples, GoldParse.links contains (start_char, end_char, gold_entity) tuples

    • align intrawiki links with en_core_web_lg NER ents, store the sentence as local context
    • some custom filtering to keep memory usage down: articles below 30K chars and sentences between 5 and 100 tokens
  • run training of the pipe similar to training NER or textcat:

    • the context encoder runs a Doc through a CNN and outputs 64D vectors that are compared to the entity vectors
    • Training loss is defined by calculating the partial derivative of cosine similarity between context and entity vectors (currently only pos examples)
    • Dev performance is measured as accuracy, comparing the gold IDs for the NER ents to their predicted entity IDs
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "entity_linker"] 
with nlp.disable_pipes(*other_pipes): 
    ...
    nlp.update(...)
   ...

File overview

  • in examples/pipeline/ :

    • wikidata_entity_linking.py : shows the full pipeline for running the Entity Linking functionality
  • in /bin:

    • wikidata_processor.py: Parse the JSON wiki data to get all entities (takes about 7h to parse 55M lines)
    • wikipedia_processor.py: Process the XML Wikipedia dump to calculate entity frequencies and prior probabilities (takes about 2h to parse 1100M lines)
    • training_set_creator.py: Process the XML Wikipedia dump to parse out raw text + gold-standard entities in offset format per article (takes about 15h to parse 60M articles on 1100M lines)
    • train_descriptions.py: Train the embeddings of entity descriptions to fit a fixed-size entity vector (e.g. 64D).
    • kb_creator.py : Pipeline to create the knowledge base from Wikidata entries
  • in tests/ :

    • pipeline/test_entity_linker.py: test correct & incorrect usage of the KB
    • serialize/test_serialize_kb.py: test correct writing & reading of KB

First results

With 1.1M entities in the KB (14% of all), "oracle" KB accuracy is 84.2% on a dev set of 5K entities, and prior probability by itself gives 78.2% accuracy. Adding sentence context improves this marginally, but consistently. Training on 200K entities for 2h, the context encoder (by itself, but limited by the KB) gets to 73.9% accuracy and brings the combined accuracy with the prior probabilities to 79.0%. The "pick one at random" baseline on the other hand would only achieve around 54%, so the context encoder is doing relatively well on its own.

I experimented with more complex models than the ones in this PR, including e.g. an additional document encoder, but I figured that would bias the results too much towards the typical structure of Wikipedia documents with 2-3 introductory sentences. And preliminary tests with other models didn't improve the accuracy.

Beyond this PR

There's multiple ways of further building on top of this result:

  1. bring in more structured data from wikidata (e.g. whether entities have a day of birth, and when)
  2. implement a compositional model for entity vectors
  3. refine the context encoder (currently simply the whole sentence and a CNN)
  4. bring in document consistency either through coreference resolution or unification of the set of predictions per document
  5. implement a document encoder using the other entities in a document
  6. set a threshold for NIL predictions. Currently not done because there are no gold "false" examples in the data, because WP annotation is incomplete
  7. learn (also) from incorrect examples by generating wrong candidates from the KB (preliminary results were not so encouraging)
  8. add in other information from the NLP pipeline such as NER type, POS, etc.
  9. further experiment with hyperparameters of KB construction & model implementation and/or other models
  10. k-best NER
  11. fuzzy matching for candidate generation

Example output

for ent in doc.ents:
    print(ent.text, ent.label_, ent.kb_id_)

In The Hitchhiker's Guide to the Galaxy, written by Douglas Adams, Douglas reminds us to always bring our towel, even in China or Brazil. The main character in Doug's novel is the man Arthur Dent, but Douglas doesn't write about George Washington or Homer Simpson.

The Hitchhiker's Guide WORK_OF_ART
Douglas Adams PERSON Q42
Douglas PERSON Q156220
China GPE Q148
Brazil GPE Q155
Doug PERSON Q1251705
Arthur Dent PERSON Q613901
Douglas PERSON Q156220
George Washington PERSON Q23
Homer Simpson PERSON Q7810

Checklist

  • I have submitted the spaCy Contributor Agreement.
  • I ran the tests, and all new and existing tests passed.
  • My changes don't require a change to the documentation, or if they do, I've added all required information.

spacy/language.py Outdated Show resolved Hide resolved
@svlandeg
Copy link
Member Author

Thanks for the detailed reviews! I'm working on addressing them. Additionally, I just found that Span.as_doc() is not preserving the entity links because that's not properly defined as an attribute yet.
I guess I need to add an ENT_KB_ID or something such to attrs.pyx that links to Token.ent_kb_id ?

@svlandeg
Copy link
Member Author

svlandeg commented Jul 4, 2019

Update @ines @honnibal : I've addressed most comments - feel free to review again. It still needs to be adjusted & tested on GPU though. I'm happy to look into that, though it will take me a bit longer than when you do Matt, but I don't mind either way.

The main thing that changed since the publication of this PR is that originally, the model was looking at cosine similarity between the context (sentence) and entity (description) encoders, but now there's an additional EL model that takes the input of the context & entity encoders + additional features such as prior probability and NER types, and runs a net on top of that. This is a more generic setup and will allow for more flexibility, but it's a little awkward how that currently results in two models being trained in the EntityLinker pipe: the context encoder (Tok2Vec) and the actual EL model. I wasn't sure how to define the optimizer for the context encoder, so right now it's a field sgd_context in EntityLinker. Ideally I would have wanted to chain the context linker into the EL model, making the backprop easier too, but I'm not sure how to do that in this case because it needs to be concated first with an encoder that uses a different input.

@ines
Copy link
Member

ines commented Jul 10, 2019

Okay, I'll go ahead and merge this 🎉 It'll still be inofficial and undocumented, but it'd allow people who want to test it to do so. We can then keep adding to it until it's ready for "official" release 🙂

@ines ines merged commit 6ba5ddb into explosion:master Jul 10, 2019
@svlandeg svlandeg deleted the feature/nel-wiki branch July 10, 2019 15:34
@svlandeg svlandeg mentioned this pull request Aug 1, 2019
3 tasks
@zainbnv
Copy link

zainbnv commented Dec 16, 2019

I have successfully trained a spacy entity linking model(obviously by limiting the data).

my question is how to display the description of entity from kb as output?

import spacy
nlp = spacy.load(r"D:\el model\nlp")
doc = nlp("Amir Khan is a great boxer")
ents = [(e.text, e.label_, e.kb_id_) for e in doc.ents]
print(ents)

@svlandeg
Copy link
Member Author

Hi @zainbnv. The descriptions are currently not stored in the KB itself because of performance reasons. However, from the intermediary results during processing, you should have a file entity_descriptions.csv that maps the WikiData ID to its description in a simple tabular format.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Feature requests and improvements feat / nel Feature: Named Entity linking feat / ner Feature: Named Entity Recognizer
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants