Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vectors.most_similar() raises ValueError when query vectors return different num matches #5320

Closed
bdewilde opened this issue Apr 16, 2020 · 4 comments · Fixed by #5348
Closed
Labels
bug Bugs and behaviour differing from documentation feat / vectors Feature: Word vectors and similarity

Comments

@bdewilde
Copy link

How to reproduce the behaviour

In the case that multiple queries passed in a given call to Vectors.most_similar() return different numbers of results — fewer than the specified n — the function fails with a cryptic numpy exception: ValueError: setting an array element with a sequence. Apparently this is raised when you try to create an array from lists of different lengths:

>>> np.array([[1, 2], [1, 2, 3]], dtype="int64")
ValueError                                Traceback (most recent call last)
<ipython-input-86-2d17c17be6b1> in <module>
----> 1 np.array([[1, 2], [1, 2, 3]], dtype="int64")

ValueError: setting an array element with a sequence.

I think these lines are causing it — https:/explosion/spaCy/blob/master/spacy/vectors.pyx#L361-L363 — but can't convince my debugger to dig into the cython. Here's some further evidence:

(Pdb)  vocab.vectors.most_similar(np.asarray([query_vectors[0]]), n=10)[0].shape
(1, 8)
(Pdb)  vocab.vectors.most_similar(np.asarray([query_vectors[1]]), n=10)[0].shape
(1, 10)
(Pdb)  vocab.vectors.most_similar(np.asarray([query_vectors[2]]), n=10)[0].shape
(1, 9)
(Pdb)  vocab.vectors.most_similar(np.asarray([query_vectors[3]]), n=10)[0].shape
(1, 10)
(Pdb)  vocab.vectors.most_similar(query_vectors, n=10)
*** ValueError: setting an array element with a sequence.

It's possible that this is just a weird edge case, since I'm populating my vocab / vectors table from scratch using a relatively small corpus of (1k docs). But maybe this is a realistic issue for the pre-trained vocab/vectors when n is large.

Your Environment

  • spaCy version: 2.2.4
  • Platform: Darwin-19.3.0-x86_64-i386-64bit
  • Python version: 3.7.4
@svlandeg svlandeg added the feat / vectors Feature: Word vectors and similarity label Apr 18, 2020
@svlandeg
Copy link
Member

Thanks for the report and the detailed analysis!

Looks like a bug to me, and something we should definitely investigate further.

Any chance you have a small reproducible code snippet (with a mockup vocab maybe?) that triggers this error? That would help us dig into this faster :-)

@svlandeg svlandeg added the bug Bugs and behaviour differing from documentation label Apr 18, 2020
@bdewilde
Copy link
Author

Hi @svlandeg , I came up with a (very haphazard) example that raises this error:

import gensim
import numpy as np
import spacy

lang = "en"
embed_size = 100
texts = [
    "Have you listened to the new Fiona Apple album yet?",
    "I've had it on repeat since yesterday, and wow, it's so so great.",
    "Almost makes the 8-year wait worth it!",
]

spacy_lang = spacy.blank(lang)
docs = spacy_lang.pipe(texts)
sents = [[tok.text for tok in doc] for doc in docs]
# generating custom fasttext word embedding vectors
ft = gensim.models.fasttext.FastText(
    sentences=sents,
    size=embed_size,
    min_count=1,
    window=5,
    iter=5,
)
# reset vectors on vocab object w/ desired embedding size
# see: https://spacy.io/usage/vectors-similarity#custom
spacy_lang.vocab.reset_vectors(width=embed_size)
for word in ft.wv.vocab:
    spacy_lang.vocab.set_vector(word, ft.wv[word])
    
query_vectors = np.asarray([spacy_lang.vocab.get_vector(word) for word in ["music", "album", "I"]])
keys, _, _ = spacy_lang.vocab.vectors.most_similar(query_vectors, n=5)

Thanks for digging in!

@adrianeboyd
Copy link
Contributor

Ah, entertaining bugs. Here most_similar is also searching in the empty all-0 padding rows of the internal vectors table (it has some padding so it doesn't have to resize for each new vector, just when it gets full). For "music", which isn't assigned a vector in the model so it gets the default all-0 vector, it returns closest matches out of the all-0 padding rows. These individual matches get filtered out because it knows the rows aren't in use, but then you end up with fewer matches for some queries than others.

@github-actions
Copy link
Contributor

github-actions bot commented Nov 5, 2021

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 5, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Bugs and behaviour differing from documentation feat / vectors Feature: Word vectors and similarity
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants