vector_norm throws error for unusual text in sentence with more than one word. #4673

mmaybeno · 2019-11-19T21:43:50Z

How to reproduce the behaviour

I found this bug that I'm not sure if it resides in spacy or cupy. It only appears on GPU instances and when you try to get vectors from a multiple word document containing non standard words. Any help tracking it down with a potential fix would be fantastic.

import en_core_web_md
import spacy
spacy.prefer_gpu()
nlp = en_core_web_md.load()

doc = nlp("somerandomword")
doc.vector_norm
# works

doc = nlp("somerandomword.")
doc.vector_norm
# throws type error

doc = nlp("The somerandomword")
doc.vector_norm
# throws type error

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-96-d19c7b8f8943> in <module>()
----> 1 doc.vector_norm
doc.pyx in spacy.tokens.doc.Doc.vector_norm.__get__()
doc.pyx in __iter__()
cupy/core/core.pyx in cupy.core.core.ndarray.__add__()
cupy/core/_kernel.pyx in cupy.core._kernel.ufunc.__call__()
cupy/core/_kernel.pyx in cupy.core._kernel._preprocess_args()
TypeError: Unsupported type <class 'numpy.ndarray'>

Your Environment

Operating System: Ubuntu 18.04.3
Python Version Used: 3.6.8
spaCy Version Used: 2.2.2
Environment Information: Running on Google Colab but also experienced it on other GPU instances.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.50       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   34C    P8    26W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

!pip install spacy==2.2.2
!pip install chainer
!pip install thinc_gpu_ops thinc
!python -m spacy download en_core_web_md

The text was updated successfully, but these errors were encountered:

mmaybeno · 2019-11-19T21:48:19Z

For context. I found this cupy PR that was merged recently and thought that would fix it, but building from source didn't appear to fix this issue.
cupy/cupy#2611

mmaybeno · 2019-11-19T22:33:58Z

I think it has to do with this part specifically.
sum(t.vector for t in self)
https:/explosion/spaCy/blob/master/spacy/tokens/doc.pyx#L439

mmaybeno · 2019-11-19T22:38:18Z

Found it. There is a case where a token will have an empty numpy.ndarray instead of a cupy.core.core.ndarray, which makes incompatible types when you try and sum them.

[type(t.vector) for t in nlp("somerandomword.")]
# [numpy.ndarray, cupy.core.core.ndarray]

mmaybeno · 2019-11-19T22:59:27Z

Vocab's get_vector defaults to a numpy array, so if the word does not exist it will stay a zero numpy array. I think this is the bug. https:/explosion/spaCy/blob/master/spacy/vocab.pyx#L364

mmaybeno · 2019-11-20T06:26:03Z

Attempting to create a PR for this fix but unsure on how to test it since it's cupy related.

adrianeboyd · 2019-12-10T08:48:47Z

Fixed by #4680.

rjurney · 2019-12-10T17:08:13Z

Awesome!

lock · 2020-01-09T18:01:55Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

adrianeboyd added bug Bugs and behaviour differing from documentation feat / vectors Feature: Word vectors and similarity labels Nov 20, 2019

mmaybeno mentioned this issue Nov 20, 2019

Agnostic vocab array fix #4680

Merged

3 tasks

adrianeboyd mentioned this issue Nov 21, 2019

Doc.similarity GPU bug still present in 2.2.2: TypeError: Unsupported type <class 'numpy.ndarray'> #4687

Closed

adrianeboyd closed this as completed Dec 10, 2019

adrianeboyd mentioned this issue Dec 16, 2019

Inconsistent return value type for .vector on GPU #4809

Closed

lock bot locked as resolved and limited conversation to collaborators Jan 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vector_norm throws error for unusual text in sentence with more than one word. #4673

vector_norm throws error for unusual text in sentence with more than one word. #4673

mmaybeno commented Nov 19, 2019 •

edited

Loading

mmaybeno commented Nov 19, 2019

mmaybeno commented Nov 19, 2019

mmaybeno commented Nov 19, 2019

mmaybeno commented Nov 19, 2019

mmaybeno commented Nov 20, 2019

adrianeboyd commented Dec 10, 2019

rjurney commented Dec 10, 2019

lock bot commented Jan 9, 2020

vector_norm throws error for unusual text in sentence with more than one word. #4673

vector_norm throws error for unusual text in sentence with more than one word. #4673

Comments

mmaybeno commented Nov 19, 2019 • edited Loading

How to reproduce the behaviour

Your Environment

mmaybeno commented Nov 19, 2019

mmaybeno commented Nov 19, 2019

mmaybeno commented Nov 19, 2019

mmaybeno commented Nov 19, 2019

mmaybeno commented Nov 20, 2019

adrianeboyd commented Dec 10, 2019

rjurney commented Dec 10, 2019

lock bot commented Jan 9, 2020

mmaybeno commented Nov 19, 2019 •

edited

Loading