Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Span __hash__ and __eq__ are incompatible #4965

Closed
Evpok opened this issue Feb 2, 2020 · 5 comments · Fixed by #5005
Closed

Span __hash__ and __eq__ are incompatible #4965

Evpok opened this issue Feb 2, 2020 · 5 comments · Fixed by #5005
Labels
bug Bugs and behaviour differing from documentation feat / doc Feature: Doc, Span and Token objects

Comments

@Evpok
Copy link

Evpok commented Feb 2, 2020

From __hash__ doc

The only required property is that objects which compare equal have the same hash value

While right now

import spacy
nlp = spacy.load("fr_core_news_sm")
d = nlp("Le succès de General Electric n'est dépassé que par celui de France Télécom")

Gives

>>> d[3:5] == d.ents[0]
True
>>> hash(d[3:5]) == hash(d.ents[0])
False

I understand the rationale here: it is convenient to be able to check that Spans effectively cover the same spans, while Spans bearing different annotations are conceptually different, but it makes functions that depend on hash fail in counter-intuitive ways such as

>>> d[3:5] in d.ents
True
>>> d[3:5] in set(d.ents)
False

Or the specific issue that made me notice this

>>> ent_dict = {e: e.label_ for e in d.ents}
>>> ent_dict[d[3:5]]
KeyError: General Electric
@adrianeboyd adrianeboyd added bug Bugs and behaviour differing from documentation feat / doc Feature: Doc, Span and Token objects labels Feb 3, 2020
@adrianeboyd
Copy link
Contributor

That's a very charitable explanation for everything! I think it's more likely an oversight that the comparison functions weren't updated when more attributes were added.

Since it's deprecated, we might want to replace the __richcmp__ function with the typical python __le__, etc. functions and consider all the span attributes in the hash and comparisons, like the newer kb_id, too. In any case, __hash__ and __eq__ should definitely do the same thing here.

Off the top of my head I'm not totally sure how to compare labels or kb ids, e.g.:

Span(doc, 0, 5, label=1) < Span(doc, 0, 5, label=2) == ???

But __eq__ should be easy to get right.

@Evpok
Copy link
Author

Evpok commented Feb 3, 2020

But __eq__ should be easy to get right.

In that case, is there an easy replacement to check that two Span well… span the same tokens?

@adrianeboyd
Copy link
Contributor

I take what I said about __eq__ back a bit. There's quite a bit to compare if you really want to compare Span objects properly. From the init:

Doc doc, int start, int end, label=0, vector=None, vector_norm=None, kb_id=0

I don't think there's an easy replacement, but if __eq__ and __hash__ don't compare everything, it's pretty weird behavior for an object. You can even currently compare spans from different docs and they're equal if the offsets are the same, which is definitely not what anybody should want.

I think you might have to just write a utility function to compare the offsets if that's what you're interested in. I'd have to think about whether there's an elegant way to add this to the library.

@Evpok
Copy link
Author

Evpok commented Feb 3, 2020

I think you might have to just write a utility function to compare the offsets if that's what you're interested in

Not cross-document, even if that might make some sense if you process the same input and can guarantee that the tokenization is the same but that seems to specific. So I think either a token-by-token comparison or checking that the parent Doc is the same and that the offsets match?

@lock
Copy link

lock bot commented Mar 17, 2020

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Mar 17, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Bugs and behaviour differing from documentation feat / doc Feature: Doc, Span and Token objects
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants