-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
💫 Entity Linking in spaCy #3339
Comments
This sounds really exciting! I'm curious how this relates to a task I've used spaCy for in the past (others may have too). The use-case is you're a user with a small KB, which is a set of entities (possibly with aliases) that you want to link text to. Currently, you can roll-your-own system using the existing NER with rule-based patches when required, then matching and ranking candidates. But, if you already had a big Wikipedia model available for linking, I can imagine wanting to try to match into the small KB then back-off to Wikipedia (or is there some weird KB-composition operation???), ideally with the same interfaces. This kind of thing is almost certainly an MVP non-goal, but I'm interested to see if it's something the team is thinking about. |
Is there any plan to integrate this with Wikidata? |
@turbolent : as our focus will be on linking to Wikipedia in the first phases, I think an integration with Wikidata will come naturally. There are crosslinks between the two anyway, and I've seen some prior work where the Wikidata knowledge graph was used to tune the prior probabilities for P(entity|alias). So yea, I think it's an important resource to exploit. |
@wejradford : interesting use-case. So basically you'd want to link to your small KB and have WP linking as a sort of fall-back? But then would you want to keep track of two different knowledge DBs or would you somehow unify/merge them? |
Really cool! Wondering about point 3: wouldn't it be easier to link non-english text to non-english wikipedia? And then use language links from that non-english wikipedia to jump to the english wikipedia? It would have the added benefit of being able to (also) use the non-english wikipedia links. |
@anneschuth I'd rather have a canonical knowledge base with potentially language-specific feature vectors. That feels a little bit more generic and less Wikipedia-specific to me. I think it also makes sense to do the KB reconciliation once, as an offline data-dependent step, and not mix it into the runtime. |
@svlandeg that's a good question. For the use-cases I'm familiar with, I wouldn't expect spaCy to merge them. For example, I might want to pull out general entities from Wikipedia, but also some specific things like less-prominent entities of the usual type, or perhaps commands/intents if it's a dialogue case. |
@wejradford : gotcha. I think we were sort of assuming one KB and one EL component per pipeline, but it's an interesting use-case to think about... |
Cool feature! +1 for linking non-English text to non-English Wikipedia and then optionally exploiting Wikipedia's cross-language links. The estimation of P(entity|alias) makes a lot more sense to me if the aliases are in the same language as the mentions. Plus, I expect the coverage of entities for document collection in a specific language to be higher in the corresponding non-English Wikipedia than in the English Wikipedia. |
The downside is ofcourse that any non-English WP is significantly smaller than the English one (https://meta.wikimedia.org/wiki/List_of_Wikipedias). Ofcourse you can still use the interwiki links to go from the overlapping subset of WP:EN to the set of links available in the other language. I do agree with the idea that exploiting non-english WP links would be useful, and the prior probabilities & candidate generation for XEL will certainly be an interesting task to tackle... |
By the way, the XELMS paper by Upadhyay et al in EMNLP'18 has some interesting results around the topic of cross-lingual entity linking. Basically they train a XEL model using multilingual supervision, exploiting both the richer content in WP:EN as well as language-sensitive information in the target language. The paper has some interesting experiments (e.g. Table 3) comparing to prior work as well as comparing their system using either monolingual or join supervision. |
So, this will basically make disambiguation easier? |
We're aiming for an in-memory implementation using a Cython backend, so you'll probably have to convert your triplet store to the spaCy KB structure using APIs that we'll make available for adding entities, aliases and prior probabilities. Would that work for you? |
That would be great, currently I am "converting" them to the phrase matcher, so it will be probably no issue. |
You should definitely take a look at this to encode context and wikipedia articles : And this for some great state of the art : https:/openai/deeptype |
@turbolent @anneschuth @dodijk : So we're thinking about centering the initial work around WikiData specifically. This would ensure we have the same KB across languages and would support the cross-lingual linking in a (hopefully) more straightforward fashion. An additional advantage would be that IDs are more stable (WP titles can change). Also, WikiData seems to have much more coverage (WP:EN has 5.8M pages, WikiData has 55M entities). For example, @honnibal doesn't seem to have a WP page but does have a WikiData entry |
Planning to do the annotations using Wikidata is great! The Wikidata canonical URIs are like http://www.wikidata.org/entity/Q47153978 |
Hi @svlandeg - I'm one of the authors of scispaCy. Cool to see you were thinking of using it in your stage 6 design doc. We've actually begun to do a bit of work on this ourselves (nothing fancy, mainly just some string matching/sklearn classifier type of approaches), and we have a knowledge base (a filtered version of the Unified Medical Language System, such that we can distribute it). It has 2.78M concepts (down from 3.3M concepts on the full UMLS release) and it covers 99% of the entities in the Med Mentions Dataset. We'd be happy to help out, either by testing out components as you go, or by implementing the entity linking system you land on for biomedical and seeing what goes wrong. The med mentions data is very nice and easy to work with - it's another option to consider as well as the gene databases that you have listed in your presentation. |
Hi @DeNeutoy ! Nice work on scispaCy :-) We're currently focusing on getting the general architecture & APIs in place, that should allow you to connect any KB and EL model into the usual spaCy And it would definitely be great to collaborate on getting the infrastructure to work for the biomedical domain, BioNLP being my first true love ;-) |
I am really looking forward for this feature that I needed in the past. |
PR #3459 : first general framework, API's etc, using a simple dummy algorithm for now. All feedback welcome! |
Hi @svlandeg, One potential solution is featurizing entity names in the KB and the text span, then using cosine similarity to generate candidates. The features could be sparse (I tried char-n-gram for scispaCy entity linking and they work well) and/or dense (embeddings for the text. It would be nice if the embeddings are character-based not word-based). It is also possible to have the featurizing function be an input to the NEL model. With a KB of millions of candidates, you will also need some form of fast approximate nearest neighbors. I tried Another improvement is having the features weighted before computing cosine similarities. I tried simple tfidf weighting of the char-n-gram features, but a more fancy solution is learning feature weights as in Chen and Van Durme, EACL 2017 (https:/ctongfei/probe). |
Hi @ibeltagy : thanks for the pointers and ideas! Definitely worth considering, you're right that prior probabilities are not always easy to come by, so we'll have to make sure that there are viable alternatives. |
@svlandeg, nice project! Loading a large KB into memory can be quite time consuming. Do you have any ideas regarding speeding up loading? |
It's sort of next-up on the TODO list to try and load a significant part of WikiData in memory and see how that goes ;-) |
Until Spacy includes Entity Linking, what would be the next-best system that is free and could be used out-of-the-box? Are there usable solutions based on OpenAI's DeepType paper? Or wikipedia2vec? It seems to me that a joint NER and EL solution would be wayyyy better than any system that is strictly sequential and attempts to solve NER before linking entities. Other solutions on my list:
|
A new paper about entity linking with Wikidata that might be relevant: OpenTapioca: Lightweight Entity Linking for Wikidata |
@Tpt : it's a tempting idea to not have to rely on WP data, but I think it comes with serious limitations, too. As the authors point out, there's no good way to get prior probabilities, and the aliases you obtain are somewhat artificially clean. For instance, using WP links and coreference resolution, you could find a whole bunch of candidate entities for "The president", while this particular mention is probably too vague to be added to Wikidata. But the vagueness of it is realistic in actual texts. I like the approach of exploiting the Wikidata knowledge graph to improve semantic similarity between the entities though! |
As a quick update and also in reply @koenvanderveen: We've written a custom reader/writer to store the KB (Cython datastructure) on file and read the entries back in in bulk. As a POC, focusing on the |
Any updates on this? I'm probably most interested in linking ORG entities. It would be amazing if spacy could cover the problem domain of entity linking as well. |
@pwichmann : yep - we're definitely in full swing working on this. We're currently designing and testing the neural net to train the entity linker on the The current setup is roughly like this: Candidate generation:
Candidate ranking:
Each candidate+mention pair is then run through the network and a probability is obtained as to how likely they match. This output is then combined with the prior probabilities of the candidates to obtain a final score for each pair. |
If spaCy (or its users) plan to link to Wikipedia be aware that Wikipedia URLs can be unstable over time. If you need to store links persistently I've found it's best to use Wikidata IDs and them resolve these to Wikipedia URLs as late as possible. |
@jcnewell : yep, that's exactly what we'll be doing - the knowledge base will be centered around Wikidata IDs. |
It will be possible to link to your own database instead of wikipedia?
tor. 6. jun. 2019 kl. 10:51 skrev Sofie <[email protected]>:
… @jcnewell <https:/jcnewell> : yep, that's exactly what we'll
be doing - the knowledge base will be centered around Wikidata IDs.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#3339?email_source=notifications&email_token=ABTX3RF63DCGN2QGM3BRKLDPZDF2TA5CNFSM4G2QFZS2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXCFYKY#issuecomment-499407915>,
or mute the thread
<https:/notifications/unsubscribe-auth/ABTX3RBNBENC4XFTSFIDZ3TPZDF2TANCNFSM4G2QFZSQ>
.
|
@tsoernes : that should be possible, but then you'll have to plug in your own KB and your own training data. |
Hello! About free open source tools, you can add to the list entity-fishing which is doing Wikidata entity recognition and disambiguation for 5 languages, with many options, and not restricted to NER (it can be restricted to NER of course). It does that at scale... since 2017... I think I failed a bit on the communication side for this tool :) I've started to add a DeepType implementation at some point (with BidLSTM-CRF model for final typing), but not progressing a lot due to other ongoing projects. |
One info that you might find useful, the way I managed to use efficiently millions of entities, vocabularies, links, statistics, word and entity embeddings in different languages was to use LMDB as embedded databases. Basically there is no loading, all the resources are immediately "warm" and I could get up to 600K multithreaded access per second (with SSD). It was in Java, but the Python LMDB binding is very good and robust too. |
If I follow correctly your design doc and this thread: you obtain the final score for each candidate+mention pair, apply a confidence threshold and get a link or NIL for each mention; context is encoded by the sentence encoder and the article encoder. Interested in your thoughts on how good a baseline it could be. |
I think we can close this now 🎉 . We're still working on better models, but the functionality is all shipped. |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
Feature description
With @honnibal & @ines we have been discussing adding an Entity Linking module to spaCy. This module would run on top of NER results and disambiguate & link tagged mentions to a knowledge base. We are thinking of implementing this in a few different phases:
Notes
As some prior research, we compiled some notes on this project & its requirements: https://drive.google.com/file/d/1UYnPgx3XjhUx48uNNQ3kZZoDzxoinSBF. This contains more details on the EL components and implementation phases.
Feedback requested
We will start implementing the APIs soon, but we would love to hear your ideas, suggestions, requests with respect to this new functionality first!
The text was updated successfully, but these errors were encountered: