Entity linking using Wikipedia & Wikidata #3864

svlandeg · 2019-06-19T10:56:17Z

Implementation of the Entity Linker (cf. #Issue #3339) using Wikidata entities and Wikipedia training. Focus of this PR is on the general pipeline - further performance improvements can certainly be made.

Note: this PR temporarily reverts this edit as it broke the parsing by en_core_web_lg.

Description

Added core functionality

KB stores entity vectors for each entity
KB to/from bytes + unit test
various improvements to KB code (nogil)
implementation of Entity linking pipe, using entity encoder and context encoder
Entity linker pipe to/from file (as part of writing nlp object)
GoldParse has links property with (start_char, end_char, gold_kb_id) tuples

Types of change

New feature

General flow & implementation with Wikipedia & WikiData

create KB: vocab + entities (freq, entity vector) + aliases (prior prob)
- prior probabilities from intrawiki links on Wikipedia
- total of 7.9M entities/titles filtered down to 1.1M by min. 20 incoming intrawiki links for an entity
- trained entity encodings with simple encoder-decoder self-supervision to 64D vectors
- alias defined by max. 10 candidates per entity and min. 5 occurrences of an alias-entity pair: 1.6M aliases
- the result is a 345MB KB
obtain training data: list of (Doc, GoldParse) tuples, GoldParse.links contains (start_char, end_char, gold_entity) tuples
- align intrawiki links with en_core_web_lg NER ents, store the sentence as local context
- some custom filtering to keep memory usage down: articles below 30K chars and sentences between 5 and 100 tokens
run training of the pipe similar to training NER or textcat:
- the context encoder runs a Doc through a CNN and outputs 64D vectors that are compared to the entity vectors
- Training loss is defined by calculating the partial derivative of cosine similarity between context and entity vectors (currently only pos examples)
- Dev performance is measured as accuracy, comparing the gold IDs for the NER ents to their predicted entity IDs

other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "entity_linker"] 
with nlp.disable_pipes(*other_pipes): 
    ...
    nlp.update(...)
   ...

File overview

in examples/pipeline/ :
- wikidata_entity_linking.py : shows the full pipeline for running the Entity Linking functionality
in /bin:
- wikidata_processor.py: Parse the JSON wiki data to get all entities (takes about 7h to parse 55M lines)
- wikipedia_processor.py: Process the XML Wikipedia dump to calculate entity frequencies and prior probabilities (takes about 2h to parse 1100M lines)
- training_set_creator.py: Process the XML Wikipedia dump to parse out raw text + gold-standard entities in offset format per article (takes about 15h to parse 60M articles on 1100M lines)
- train_descriptions.py: Train the embeddings of entity descriptions to fit a fixed-size entity vector (e.g. 64D).
- kb_creator.py : Pipeline to create the knowledge base from Wikidata entries
in tests/ :
- pipeline/test_entity_linker.py: test correct & incorrect usage of the KB
- serialize/test_serialize_kb.py: test correct writing & reading of KB

First results

With 1.1M entities in the KB (14% of all), "oracle" KB accuracy is 84.2% on a dev set of 5K entities, and prior probability by itself gives 78.2% accuracy. Adding sentence context improves this marginally, but consistently. Training on 200K entities for 2h, the context encoder (by itself, but limited by the KB) gets to 73.9% accuracy and brings the combined accuracy with the prior probabilities to 79.0%. The "pick one at random" baseline on the other hand would only achieve around 54%, so the context encoder is doing relatively well on its own.

I experimented with more complex models than the ones in this PR, including e.g. an additional document encoder, but I figured that would bias the results too much towards the typical structure of Wikipedia documents with 2-3 introductory sentences. And preliminary tests with other models didn't improve the accuracy.

Beyond this PR

There's multiple ways of further building on top of this result:

bring in more structured data from wikidata (e.g. whether entities have a day of birth, and when)
implement a compositional model for entity vectors
refine the context encoder (currently simply the whole sentence and a CNN)
bring in document consistency either through coreference resolution or unification of the set of predictions per document
implement a document encoder using the other entities in a document
set a threshold for NIL predictions. Currently not done because there are no gold "false" examples in the data, because WP annotation is incomplete
learn (also) from incorrect examples by generating wrong candidates from the KB (preliminary results were not so encouraging)
add in other information from the NLP pipeline such as NER type, POS, etc.
further experiment with hyperparameters of KB construction & model implementation and/or other models
k-best NER
fuzzy matching for candidate generation

Example output

for ent in doc.ents:
    print(ent.text, ent.label_, ent.kb_id_)

In The Hitchhiker's Guide to the Galaxy, written by Douglas Adams, Douglas reminds us to always bring our towel, even in China or Brazil. The main character in Doug's novel is the man Arthur Dent, but Douglas doesn't write about George Washington or Homer Simpson.

The Hitchhiker's Guide WORK_OF_ART
Douglas Adams PERSON Q42
Douglas PERSON Q156220
China GPE Q148
Brazil GPE Q155
Doug PERSON Q1251705
Arthur Dent PERSON Q613901
Douglas PERSON Q156220
George Washington PERSON Q23
Homer Simpson PERSON Q7810

Checklist

I have submitted the spaCy Contributor Agreement.
I ran the tests, and all new and existing tests passed.
My changes don't require a change to the documentation, or if they do, I've added all required information.

…omplete)

…d up processing

…n one doc (for now)

spacy/language.py

…into feature/nel-wiki

svlandeg · 2019-06-25T11:09:34Z

Thanks for the detailed reviews! I'm working on addressing them. Additionally, I just found that Span.as_doc() is not preserving the entity links because that's not properly defined as an attribute yet.
I guess I need to add an ENT_KB_ID or something such to attrs.pyx that links to Token.ent_kb_id ?

svlandeg · 2019-07-04T08:28:23Z

Update @ines @honnibal : I've addressed most comments - feel free to review again. It still needs to be adjusted & tested on GPU though. I'm happy to look into that, though it will take me a bit longer than when you do Matt, but I don't mind either way.

The main thing that changed since the publication of this PR is that originally, the model was looking at cosine similarity between the context (sentence) and entity (description) encoders, but now there's an additional EL model that takes the input of the context & entity encoders + additional features such as prior probability and NER types, and runs a net on top of that. This is a more generic setup and will allow for more flexibility, but it's a little awkward how that currently results in two models being trained in the EntityLinker pipe: the context encoder (Tok2Vec) and the actual EL model. I wasn't sure how to define the optimizer for the context encoder, so right now it's a field sgd_context in EntityLinker. Ideally I would have wanted to chain the context linker into the EL model, making the backprop easier too, but I'm not sure how to do that in this case because it needs to be concated first with an encoder that uses a different input.

ines · 2019-07-10T09:25:19Z

Okay, I'll go ahead and merge this 🎉 It'll still be inofficial and undocumented, but it'd allow people who want to test it to do so. We can then keep adding to it until it's ready for "official" release 🙂

zainbnv · 2019-12-16T06:12:02Z

I have successfully trained a spacy entity linking model(obviously by limiting the data).

my question is how to display the description of entity from kb as output?

import spacy
nlp = spacy.load(r"D:\el model\nlp")
doc = nlp("Amir Khan is a great boxer")
ents = [(e.text, e.label_, e.kb_id_) for e in doc.ents]
print(ents)

svlandeg · 2019-12-16T06:34:42Z

Hi @zainbnv. The descriptions are currently not stored in the KB itself because of performance reasons. However, from the intermediary results during processing, you should have a file entity_descriptions.csv that maps the WikiData ID to its description in a simple tabular format.

svlandeg added 30 commits April 10, 2019 16:06

little fixes

61a33f5

enable nogil for cython functions in kb.pxd

9a7d534

reading wikidata descriptions and aliases

6e997be

reading types, claims and sitelinks

b31a390

wikipedia dump parser and mediawiki format regex cleanup

3163331

parse wp dump for links to determine prior probabilities

6763e02

poc with few entities and collecting aliases from the WP links

10ee8df

fixes for prior prob and linking wikidata IDs with wikipedia titles

9f308eb

fix alias capitalization

9a81971

little fixes

004e5e7

custom reader and writer for _EntryC fields (first stab at it - not c…

8e70a56

…omplete)

dumping all entryC entries + (inefficient) reading back in

694fea5

bulk loading in proper order of entity indices

6e3223f

writing and reading number of entries to/from header

ad6c5e5

KB aliases to and from file

3e0cb69

unit test for KB serialization

54d0cea

simplify chains

387263d

deduce entity freq from WP corpus and serialize vocab in WP test

19e8f33

calculate entity raw counts offline to speed up KB construction

653b7d9

bulk entity writing and experiment with regex wikidata reader to spee…

60b54ae

…d up processing

reading all persons in wikidata

3629a52

allow small rounding errors

1ae41da

cleanup

8353552

parsing clean text from WP articles to use as input data for NER and NEL

581dc97

run NER on clean WP text and link to gold-standard entity IDs

cba9680

creating training data with clean WP texts and QID entities true/false

bbcb9da

try catch per article to ensure the pipeline goes on

34600c9

fix WP id parsing, speed up processing and remove ambiguous strings i…

4e92960

…n one doc (for now)

run only 100M of WP data as training dataset (9%)

f519026

refactor code to separate functionality into different files

6961215

ines reviewed Jun 21, 2019

View reviewed changes

spacy/language.py Outdated Show resolved Hide resolved

svlandeg added 4 commits June 24, 2019 10:55

small fixes

b58bace

Merge branch 'feature/nel-wiki' of https:/svlandeg/spaCy …

f4af47c

…into feature/nel-wiki

fix unicode literals

ddc73b1

clean up duplicate code

58a5b40

svlandeg added 13 commits June 25, 2019 15:28

ensure Span.as_doc keeps the entity links + unit test

8608685

try Tok2Vec instead of SpacyVectors

bee23cd

improve speed of prediction loop

1de61f6

rename to KBEntryC

dbc53b9

context encoder with Tok2Vec + linking model instead of cosine

68a0662

fix tests

1c80b85

adding prior probability as feature in the model

c664f58

experiment with adding NER types to the feature vector

2d2dea9

small fixes

3420cbe

fix for context encoder optimizer

8840d4b

deuglify kb deserializer

668b17e

remove redundancy

0ea52c8

fixing the context/prior weight settings

b7a0c9b

Merge branch 'master' into feature/nel-wiki

f2ea3e3

ines merged commit 6ba5ddb into explosion:master Jul 10, 2019

svlandeg deleted the feature/nel-wiki branch July 10, 2019 15:34

svlandeg mentioned this pull request Aug 1, 2019

Documentation for Entity Linking #4065

Merged

3 tasks

svlandeg mentioned this pull request Sep 12, 2019

Efficient GPU support for Entity Linking pipe #4281

Open

svlandeg mentioned this pull request Mar 31, 2020

fix NEL overfitting test for GPU #5236

Merged

3 tasks

jvfe mentioned this pull request Sep 2, 2020

Integrate Wikidata to SciSpaCy EntityLinker lubianat/ann#8

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Entity linking using Wikipedia & Wikidata #3864

Entity linking using Wikipedia & Wikidata #3864

svlandeg commented Jun 19, 2019 •

edited

Loading

svlandeg commented Jun 25, 2019

svlandeg commented Jul 4, 2019

ines commented Jul 10, 2019

zainbnv commented Dec 16, 2019

svlandeg commented Dec 16, 2019

Entity linking using Wikipedia & Wikidata #3864

Entity linking using Wikipedia & Wikidata #3864

Conversation

svlandeg commented Jun 19, 2019 • edited Loading

Description

Added core functionality

Types of change

General flow & implementation with Wikipedia & WikiData

File overview

First results

Beyond this PR

Example output

Checklist

svlandeg commented Jun 25, 2019

svlandeg commented Jul 4, 2019

ines commented Jul 10, 2019

zainbnv commented Dec 16, 2019

svlandeg commented Dec 16, 2019

svlandeg commented Jun 19, 2019 •

edited

Loading