-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vectors.most_similar()
raises ValueError when query vectors return different num matches
#5320
Comments
Thanks for the report and the detailed analysis! Looks like a bug to me, and something we should definitely investigate further. Any chance you have a small reproducible code snippet (with a mockup vocab maybe?) that triggers this error? That would help us dig into this faster :-) |
Hi @svlandeg , I came up with a (very haphazard) example that raises this error: import gensim
import numpy as np
import spacy
lang = "en"
embed_size = 100
texts = [
"Have you listened to the new Fiona Apple album yet?",
"I've had it on repeat since yesterday, and wow, it's so so great.",
"Almost makes the 8-year wait worth it!",
]
spacy_lang = spacy.blank(lang)
docs = spacy_lang.pipe(texts)
sents = [[tok.text for tok in doc] for doc in docs]
# generating custom fasttext word embedding vectors
ft = gensim.models.fasttext.FastText(
sentences=sents,
size=embed_size,
min_count=1,
window=5,
iter=5,
)
# reset vectors on vocab object w/ desired embedding size
# see: https://spacy.io/usage/vectors-similarity#custom
spacy_lang.vocab.reset_vectors(width=embed_size)
for word in ft.wv.vocab:
spacy_lang.vocab.set_vector(word, ft.wv[word])
query_vectors = np.asarray([spacy_lang.vocab.get_vector(word) for word in ["music", "album", "I"]])
keys, _, _ = spacy_lang.vocab.vectors.most_similar(query_vectors, n=5) Thanks for digging in! |
Ah, entertaining bugs. Here |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
How to reproduce the behaviour
In the case that multiple queries passed in a given call to
Vectors.most_similar()
return different numbers of results — fewer than the specifiedn
— the function fails with a cryptic numpy exception:ValueError: setting an array element with a sequence.
Apparently this is raised when you try to create an array from lists of different lengths:I think these lines are causing it — https:/explosion/spaCy/blob/master/spacy/vectors.pyx#L361-L363 — but can't convince my debugger to dig into the cython. Here's some further evidence:
It's possible that this is just a weird edge case, since I'm populating my vocab / vectors table from scratch using a relatively small corpus of (1k docs). But maybe this is a realistic issue for the pre-trained vocab/vectors when
n
is large.Your Environment
The text was updated successfully, but these errors were encountered: