💫 Entity Linking in spaCy #3339

svlandeg · 2019-02-27T09:13:30Z

Feature description

With @honnibal & @ines we have been discussing adding an Entity Linking module to spaCy. This module would run on top of NER results and disambiguate & link tagged mentions to a knowledge base. We are thinking of implementing this in a few different phases:

Implement an efficient encoding of a knowledge base + all APIs / interfaces, to integrate with the current processing pipeline. We would take the following components of EL into account:
- Candidate generation
- Encoding document context
- Encoding local context
- Type prediction
- Coreference resolution / ensuring global consistency
Implement a model that links English texts to English Wikipedia entries
Implement a cross-lingual model that links non-English texts to English Wikipedia entries
Fine-tune WP linking models to be able to ship them as such
Implement support in Prodigy to perform custom EL annotations for your specific project
Test / implement the models on a different domain & non-wikipedia knowledge base

Notes

As some prior research, we compiled some notes on this project & its requirements: https://drive.google.com/file/d/1UYnPgx3XjhUx48uNNQ3kZZoDzxoinSBF. This contains more details on the EL components and implementation phases.

Feedback requested

We will start implementing the APIs soon, but we would love to hear your ideas, suggestions, requests with respect to this new functionality first!

wejradford · 2019-02-27T22:51:43Z

This sounds really exciting!

I'm curious how this relates to a task I've used spaCy for in the past (others may have too). The use-case is you're a user with a small KB, which is a set of entities (possibly with aliases) that you want to link text to. Currently, you can roll-your-own system using the existing NER with rule-based patches when required, then matching and ranking candidates. But, if you already had a big Wikipedia model available for linking, I can imagine wanting to try to match into the small KB then back-off to Wikipedia (or is there some weird KB-composition operation???), ideally with the same interfaces.

This kind of thing is almost certainly an MVP non-goal, but I'm interested to see if it's something the team is thinking about.

turbolent · 2019-02-27T23:33:53Z

Is there any plan to integrate this with Wikidata?

svlandeg · 2019-02-28T07:37:17Z

@turbolent : as our focus will be on linking to Wikipedia in the first phases, I think an integration with Wikidata will come naturally. There are crosslinks between the two anyway, and I've seen some prior work where the Wikidata knowledge graph was used to tune the prior probabilities for P(entity|alias). So yea, I think it's an important resource to exploit.

svlandeg · 2019-02-28T07:47:23Z

@wejradford : interesting use-case. So basically you'd want to link to your small KB and have WP linking as a sort of fall-back? But then would you want to keep track of two different knowledge DBs or would you somehow unify/merge them?

anneschuth · 2019-02-28T08:27:46Z

Really cool!

Wondering about point 3: wouldn't it be easier to link non-english text to non-english wikipedia? And then use language links from that non-english wikipedia to jump to the english wikipedia? It would have the added benefit of being able to (also) use the non-english wikipedia links.

honnibal · 2019-02-28T08:41:03Z

@anneschuth I'd rather have a canonical knowledge base with potentially language-specific feature vectors. That feels a little bit more generic and less Wikipedia-specific to me. I think it also makes sense to do the KB reconciliation once, as an offline data-dependent step, and not mix it into the runtime.

wejradford · 2019-02-28T12:06:17Z

@svlandeg that's a good question. For the use-cases I'm familiar with, I wouldn't expect spaCy to merge them. For example, I might want to pull out general entities from Wikipedia, but also some specific things like less-prominent entities of the usual type, or perhaps commands/intents if it's a dialogue case.

svlandeg · 2019-02-28T15:23:22Z

@wejradford : gotcha. I think we were sort of assuming one KB and one EL component per pipeline, but it's an interesting use-case to think about...

dodijk · 2019-02-28T15:29:53Z

Cool feature!

+1 for linking non-English text to non-English Wikipedia and then optionally exploiting Wikipedia's cross-language links. The estimation of P(entity|alias) makes a lot more sense to me if the aliases are in the same language as the mentions. Plus, I expect the coverage of entities for document collection in a specific language to be higher in the corresponding non-English Wikipedia than in the English Wikipedia.

svlandeg · 2019-02-28T19:43:24Z

The downside is ofcourse that any non-English WP is significantly smaller than the English one (https://meta.wikimedia.org/wiki/List_of_Wikipedias). Ofcourse you can still use the interwiki links to go from the overlapping subset of WP:EN to the set of links available in the other language.

I do agree with the idea that exploiting non-english WP links would be useful, and the prior probabilities & candidate generation for XEL will certainly be an interesting task to tackle...

svlandeg · 2019-03-01T09:19:01Z

By the way, the XELMS paper by Upadhyay et al in EMNLP'18 has some interesting results around the topic of cross-lingual entity linking. Basically they train a XEL model using multilingual supervision, exploiting both the richer content in WP:EN as well as language-sensitive information in the target language. The paper has some interesting experiments (e.g. Table 3) comparing to prior work as well as comparing their system using either monolingual or join supervision.

p-sodmann · 2019-03-04T09:08:53Z

So, this will basically make disambiguation easier?
How will the implementation of the knowledgebase look like? Will I be able to connect my triplet store somehow?

svlandeg · 2019-03-04T09:19:38Z

We're aiming for an in-memory implementation using a Cython backend, so you'll probably have to convert your triplet store to the spaCy KB structure using APIs that we'll make available for adding entities, aliases and prior probabilities. Would that work for you?

p-sodmann · 2019-03-04T10:43:46Z

That would be great, currently I am "converting" them to the phrase matcher, so it will be probably no issue.

mhham · 2019-03-04T11:09:25Z

You should definitely take a look at this to encode context and wikipedia articles :
https:/wikipedia2vec/wikipedia2vec

And this for some great state of the art : https:/openai/deeptype

svlandeg · 2019-03-05T16:26:38Z

@turbolent @anneschuth @dodijk : So we're thinking about centering the initial work around WikiData specifically. This would ensure we have the same KB across languages and would support the cross-lingual linking in a (hopefully) more straightforward fashion.

An additional advantage would be that IDs are more stable (WP titles can change). Also, WikiData seems to have much more coverage (WP:EN has 5.8M pages, WikiData has 55M entities). For example, @honnibal doesn't seem to have a WP page but does have a WikiData entry Q47153978 : https://www.wikidata.org/wiki/Q47153978 :-)

Tpt · 2019-03-19T12:49:59Z

Planning to do the annotations using Wikidata is great!

The Wikidata canonical URIs are like http://www.wikidata.org/entity/Q47153978
https://www.wikidata.org/wiki/Q47153978 is the URL of the wiki page describing the entity.

DeNeutoy · 2019-03-20T16:50:49Z

Hi @svlandeg - I'm one of the authors of scispaCy. Cool to see you were thinking of using it in your stage 6 design doc. We've actually begun to do a bit of work on this ourselves (nothing fancy, mainly just some string matching/sklearn classifier type of approaches), and we have a knowledge base (a filtered version of the Unified Medical Language System, such that we can distribute it). It has 2.78M concepts (down from 3.3M concepts on the full UMLS release) and it covers 99% of the entities in the Med Mentions Dataset.

We'd be happy to help out, either by testing out components as you go, or by implementing the entity linking system you land on for biomedical and seeing what goes wrong. The med mentions data is very nice and easy to work with - it's another option to consider as well as the gene databases that you have listed in your presentation.

svlandeg · 2019-03-20T20:58:57Z

Hi @DeNeutoy ! Nice work on scispaCy :-)

We're currently focusing on getting the general architecture & APIs in place, that should allow you to connect any KB and EL model into the usual spaCy nlp processing pipeline. This may take a bit of iteration to make sure it covers all use-cases, and it would be great to get your feedback on those PR's to come (or anyone else contributing to this thread ofcourse!). I'm making some good progress on this so hope to have something preliminary out soonish.

And it would definitely be great to collaborate on getting the infrastructure to work for the biomedical domain, BioNLP being my first true love ;-)
Let's keep talking!

sammous · 2019-03-21T04:53:13Z

I am really looking forward for this feature that I needed in the past.
I will be needing it probably in the future, with a personal KB, so I am specifically wishing for step 6.
I used wikipedia2vec and was quite impressed by the quality/efficiency of the code.
I think the Al2 guys (cc @DeNeutoy) did a fantastic job on AI/Med KB, so definitely you should keep talking.
Cheers

svlandeg · 2019-03-22T14:59:55Z

PR #3459 : first general framework, API's etc, using a simple dummy algorithm for now. All feedback welcome!

ibeltagy · 2019-03-24T05:29:40Z

Hi @svlandeg,
I work with @DeNeutoy on scispaCy and I wanted to share some thoughts about supporting non-wikipedia KBs. It seems to me that most of the design for wikipedia will generalize to non-wikipedia KBs except the candidate generation part. A resource like CrossWikis should make it easy to get the prior probability P(e|m), but such resource won't be available for other KBs.

One potential solution is featurizing entity names in the KB and the text span, then using cosine similarity to generate candidates. The features could be sparse (I tried char-n-gram for scispaCy entity linking and they work well) and/or dense (embeddings for the text. It would be nice if the embeddings are character-based not word-based). It is also possible to have the featurizing function be an input to the NEL model.

With a KB of millions of candidates, you will also need some form of fast approximate nearest neighbors. I tried nmslib, and it is fast and works really well (after some parameter tuning).

Another improvement is having the features weighted before computing cosine similarities. I tried simple tfidf weighting of the char-n-gram features, but a more fancy solution is learning feature weights as in Chen and Van Durme, EACL 2017 (https:/ctongfei/probe).

svlandeg · 2019-03-24T19:13:08Z

Hi @ibeltagy : thanks for the pointers and ideas! Definitely worth considering, you're right that prior probabilities are not always easy to come by, so we'll have to make sure that there are viable alternatives.

koenvanderveen · 2019-03-25T12:02:16Z

@svlandeg, nice project! Loading a large KB into memory can be quite time consuming. Do you have any ideas regarding speeding up loading?

svlandeg · 2019-03-25T13:25:08Z

It's sort of next-up on the TODO list to try and load a significant part of WikiData in memory and see how that goes ;-)

pwichmann · 2019-04-17T14:07:29Z

Until Spacy includes Entity Linking, what would be the next-best system that is free and could be used out-of-the-box?

Are there usable solutions based on OpenAI's DeepType paper? Or wikipedia2vec?

It seems to me that a joint NER and EL solution would be wayyyy better than any system that is strictly sequential and attempts to solve NER before linking entities.

Other solutions on my list:

Wikilinks NEL (https:/wikilinks/nel)
DBpedia spotlight (https://www.dbpedia-spotlight.org/demo/)
ambiverse-nlu (https:/ambiverse-nlu/ambiverse-nlu)

Tpt · 2019-04-24T20:23:34Z

A new paper about entity linking with Wikidata that might be relevant: OpenTapioca: Lightweight Entity Linking for Wikidata

svlandeg · 2019-04-30T20:32:12Z

@Tpt : it's a tempting idea to not have to rely on WP data, but I think it comes with serious limitations, too. As the authors point out, there's no good way to get prior probabilities, and the aliases you obtain are somewhat artificially clean. For instance, using WP links and coreference resolution, you could find a whole bunch of candidate entities for "The president", while this particular mention is probably too vague to be added to Wikidata. But the vagueness of it is realistic in actual texts.

I like the approach of exploiting the Wikidata knowledge graph to improve semantic similarity between the entities though!

svlandeg · 2019-05-01T21:20:32Z

As a quick update and also in reply @koenvanderveen: We've written a custom reader/writer to store the KB (Cython datastructure) on file and read the entries back in in bulk. As a POC, focusing on the Person NER type, we selected all humans and fictional humans from WikiData and linked them to their EN:WP articles, yielding a set of 1.6M entities. WP interwiki links are used to generate realistic aliases and their prior probabilities, obtaining about 1.3M aliases. This KB is written to file, and read back in, in a matter of seconds, and the file size is about 55M. This does not yet include storing any type of additional features/vectors for the entities, which is what we'll tackle next.

pwichmann · 2019-05-23T20:34:42Z

Any updates on this? I'm probably most interested in linking ORG entities. It would be amazing if spacy could cover the problem domain of entity linking as well.

svlandeg · 2019-05-24T07:01:31Z

@pwichmann : yep - we're definitely in full swing working on this. We're currently designing and testing the neural net to train the entity linker on the Person types, just to start somewhere, but once we get good signal we'll build from there and expand to other entities including ORG.

The current setup is roughly like this:

Candidate generation:

use prior probabilities extracted from Wikipedia links to obtain a set of most likely Wikidata identifiers for a mention in text

Candidate ranking:

entity encoder: encodes the Wikidata description + relation tuples of the candidate
sentence encoder: encodes the local context of the mention
article encoder: encodes the global context of an article/doc
type encoder: takes as input the predicted NER type

Each candidate+mention pair is then run through the network and a probability is obtained as to how likely they match. This output is then combined with the prior probabilities of the candidates to obtain a final score for each pair.

jcnewell · 2019-06-06T08:28:53Z

@turbolent : as our focus will be on linking to Wikipedia in the first phases, I think an integration with Wikidata will come naturally. There are crosslinks between the two anyway, and I've seen some prior work where the Wikidata knowledge graph was used to tune the prior probabilities for P(entity|alias). So yea, I think it's an important resource to exploit.

If spaCy (or its users) plan to link to Wikipedia be aware that Wikipedia URLs can be unstable over time. If you need to store links persistently I've found it's best to use Wikidata IDs and them resolve these to Wikipedia URLs as late as possible.

svlandeg · 2019-06-06T08:51:22Z

@jcnewell : yep, that's exactly what we'll be doing - the knowledge base will be centered around Wikidata IDs.

tsoernes · 2019-06-06T08:52:41Z

It will be possible to link to your own database instead of wikipedia? tor. 6. jun. 2019 kl. 10:51 skrev Sofie <[email protected]>:

…

@jcnewell <https:/jcnewell> : yep, that's exactly what we'll be doing - the knowledge base will be centered around Wikidata IDs. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#3339?email_source=notifications&email_token=ABTX3RF63DCGN2QGM3BRKLDPZDF2TA5CNFSM4G2QFZS2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXCFYKY#issuecomment-499407915>, or mute the thread <https:/notifications/unsubscribe-auth/ABTX3RBNBENC4XFTSFIDZ3TPZDF2TANCNFSM4G2QFZSQ> .

svlandeg · 2019-06-06T08:54:15Z

@tsoernes : that should be possible, but then you'll have to plug in your own KB and your own training data.

kermitt2 · 2019-06-06T10:48:17Z

Hello!

About free open source tools, you can add to the list entity-fishing which is doing Wikidata entity recognition and disambiguation for 5 languages, with many options, and not restricted to NER (it can be restricted to NER of course).

It does that at scale... since 2017... I think I failed a bit on the communication side for this tool :)

Documentation
Demo

I've started to add a DeepType implementation at some point (with BidLSTM-CRF model for final typing), but not progressing a lot due to other ongoing projects.

kermitt2 · 2019-06-06T12:17:42Z

One info that you might find useful, the way I managed to use efficiently millions of entities, vocabularies, links, statistics, word and entity embeddings in different languages was to use LMDB as embedded databases. Basically there is no loading, all the resources are immediately "warm" and I could get up to 600K multithreaded access per second (with SSD). It was in Java, but the Python LMDB binding is very good and robust too.

louisguitton · 2019-06-08T16:01:50Z

Each candidate+mention pair is then run through the network and a probability is obtained as to how likely they match. This output is then combined with the prior probabilities of the candidates to obtain a final score for each pair.

If I follow correctly your design doc and this thread: you obtain the final score for each candidate+mention pair, apply a confidence threshold and get a link or NIL for each mention; context is encoded by the sentence encoder and the article encoder.
Have you looked at approaches like TAGME? I know it might not be state of the art, but that's the model I currently use. The voting scheme uses the context in a different way, and you consider all candidate+mention pairs at once to disambiguate.

Interested in your thoughts on how good a baseline it could be.

honnibal · 2019-10-03T20:26:15Z

I think we can close this now 🎉 . We're still working on better models, but the functionality is all shipped.

lock · 2019-11-02T20:54:31Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

honnibal added the enhancement Feature requests and improvements label Feb 27, 2019

honnibal changed the title ~~Entity Linking in spaCy~~ 💫 Entity Linking in spaCy Feb 27, 2019

ines added the feat / ner Feature: Named Entity Recognizer label Feb 27, 2019

svlandeg mentioned this issue Mar 22, 2019

Basic framework and APIs for entity linker #3459

Merged

3 tasks

kabirkhan mentioned this issue Apr 15, 2019

Add optional id property to EntityRuler patterns #3591

Merged

1 task

This was referenced Aug 1, 2019

Entity linking using Wikipedia & Wikidata #3864

Merged

Documentation for Entity Linking #4065

Merged

honnibal closed this as completed Oct 3, 2019

lock bot locked as resolved and limited conversation to collaborators Nov 2, 2019

💫 Entity Linking in spaCy #3339

💫 Entity Linking in spaCy #3339

Comments

svlandeg commented Feb 27, 2019

Feature description

Notes

Feedback requested

wejradford commented Feb 27, 2019

turbolent commented Feb 27, 2019 • edited Loading

svlandeg commented Feb 28, 2019

svlandeg commented Feb 28, 2019

anneschuth commented Feb 28, 2019

honnibal commented Feb 28, 2019

wejradford commented Feb 28, 2019

svlandeg commented Feb 28, 2019

dodijk commented Feb 28, 2019

svlandeg commented Feb 28, 2019 • edited Loading

svlandeg commented Mar 1, 2019

p-sodmann commented Mar 4, 2019

svlandeg commented Mar 4, 2019

p-sodmann commented Mar 4, 2019

mhham commented Mar 4, 2019

svlandeg commented Mar 5, 2019

Tpt commented Mar 19, 2019

DeNeutoy commented Mar 20, 2019

svlandeg commented Mar 20, 2019

sammous commented Mar 21, 2019

svlandeg commented Mar 22, 2019

ibeltagy commented Mar 24, 2019

svlandeg commented Mar 24, 2019

koenvanderveen commented Mar 25, 2019

svlandeg commented Mar 25, 2019

pwichmann commented Apr 17, 2019

Tpt commented Apr 24, 2019

svlandeg commented Apr 30, 2019

svlandeg commented May 1, 2019

pwichmann commented May 23, 2019

svlandeg commented May 24, 2019

jcnewell commented Jun 6, 2019

svlandeg commented Jun 6, 2019

tsoernes commented Jun 6, 2019 via email

svlandeg commented Jun 6, 2019

kermitt2 commented Jun 6, 2019

kermitt2 commented Jun 6, 2019

louisguitton commented Jun 8, 2019

honnibal commented Oct 3, 2019

lock bot commented Nov 2, 2019

turbolent commented Feb 27, 2019 •

edited

Loading

svlandeg commented Feb 28, 2019 •

edited

Loading