`Vectors.most_similar()` raises ValueError when query vectors return different num matches #5320

bdewilde · 2020-04-16T17:25:55Z

How to reproduce the behaviour

In the case that multiple queries passed in a given call to Vectors.most_similar() return different numbers of results — fewer than the specified n — the function fails with a cryptic numpy exception: ValueError: setting an array element with a sequence. Apparently this is raised when you try to create an array from lists of different lengths:

>>> np.array([[1, 2], [1, 2, 3]], dtype="int64")
ValueError                                Traceback (most recent call last)
<ipython-input-86-2d17c17be6b1> in <module>
----> 1 np.array([[1, 2], [1, 2, 3]], dtype="int64")

ValueError: setting an array element with a sequence.

I think these lines are causing it — https:/explosion/spaCy/blob/master/spacy/vectors.pyx#L361-L363 — but can't convince my debugger to dig into the cython. Here's some further evidence:

(Pdb)  vocab.vectors.most_similar(np.asarray([query_vectors[0]]), n=10)[0].shape
(1, 8)
(Pdb)  vocab.vectors.most_similar(np.asarray([query_vectors[1]]), n=10)[0].shape
(1, 10)
(Pdb)  vocab.vectors.most_similar(np.asarray([query_vectors[2]]), n=10)[0].shape
(1, 9)
(Pdb)  vocab.vectors.most_similar(np.asarray([query_vectors[3]]), n=10)[0].shape
(1, 10)
(Pdb)  vocab.vectors.most_similar(query_vectors, n=10)
*** ValueError: setting an array element with a sequence.

It's possible that this is just a weird edge case, since I'm populating my vocab / vectors table from scratch using a relatively small corpus of (1k docs). But maybe this is a realistic issue for the pre-trained vocab/vectors when n is large.

Your Environment

spaCy version: 2.2.4
Platform: Darwin-19.3.0-x86_64-i386-64bit
Python version: 3.7.4

The text was updated successfully, but these errors were encountered:

svlandeg · 2020-04-18T11:56:40Z

Thanks for the report and the detailed analysis!

Looks like a bug to me, and something we should definitely investigate further.

Any chance you have a small reproducible code snippet (with a mockup vocab maybe?) that triggers this error? That would help us dig into this faster :-)

bdewilde · 2020-04-18T18:27:30Z

Hi @svlandeg , I came up with a (very haphazard) example that raises this error:

import gensim
import numpy as np
import spacy

lang = "en"
embed_size = 100
texts = [
    "Have you listened to the new Fiona Apple album yet?",
    "I've had it on repeat since yesterday, and wow, it's so so great.",
    "Almost makes the 8-year wait worth it!",
]

spacy_lang = spacy.blank(lang)
docs = spacy_lang.pipe(texts)
sents = [[tok.text for tok in doc] for doc in docs]
# generating custom fasttext word embedding vectors
ft = gensim.models.fasttext.FastText(
    sentences=sents,
    size=embed_size,
    min_count=1,
    window=5,
    iter=5,
)
# reset vectors on vocab object w/ desired embedding size
# see: https://spacy.io/usage/vectors-similarity#custom
spacy_lang.vocab.reset_vectors(width=embed_size)
for word in ft.wv.vocab:
    spacy_lang.vocab.set_vector(word, ft.wv[word])
    
query_vectors = np.asarray([spacy_lang.vocab.get_vector(word) for word in ["music", "album", "I"]])
keys, _, _ = spacy_lang.vocab.vectors.most_similar(query_vectors, n=5)

Thanks for digging in!

adrianeboyd · 2020-04-24T14:01:32Z

Ah, entertaining bugs. Here most_similar is also searching in the empty all-0 padding rows of the internal vectors table (it has some padding so it doesn't have to resize for each new vector, just when it gets full). For "music", which isn't assigned a vector in the model so it gets the default all-0 vector, it returns closest matches out of the all-0 padding rows. These individual matches get filtered out because it knows the rows aren't in use, but then you end up with fewer matches for some queries than others.

github-actions · 2021-11-05T00:01:49Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

svlandeg added the feat / vectors Feature: Word vectors and similarity label Apr 18, 2020

svlandeg added the bug Bugs and behaviour differing from documentation label Apr 18, 2020

adrianeboyd mentioned this issue Apr 27, 2020

Fix most_similar for vectors with unused rows #5348

Merged

3 tasks

honnibal closed this as completed in #5348 May 19, 2020

github-actions bot locked as resolved and limited conversation to collaborators Nov 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`Vectors.most_similar()` raises ValueError when query vectors return different num matches #5320

`Vectors.most_similar()` raises ValueError when query vectors return different num matches #5320

bdewilde commented Apr 16, 2020

svlandeg commented Apr 18, 2020

bdewilde commented Apr 18, 2020

adrianeboyd commented Apr 24, 2020

github-actions bot commented Nov 5, 2021

Vectors.most_similar() raises ValueError when query vectors return different num matches #5320

Vectors.most_similar() raises ValueError when query vectors return different num matches #5320

Comments

bdewilde commented Apr 16, 2020

How to reproduce the behaviour

Your Environment

svlandeg commented Apr 18, 2020

bdewilde commented Apr 18, 2020

adrianeboyd commented Apr 24, 2020

github-actions bot commented Nov 5, 2021

`Vectors.most_similar()` raises ValueError when query vectors return different num matches #5320

`Vectors.most_similar()` raises ValueError when query vectors return different num matches #5320