Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vector_norm throws error for unusual text in sentence with more than one word. #4673

Closed
mmaybeno opened this issue Nov 19, 2019 · 8 comments
Closed
Labels
bug Bugs and behaviour differing from documentation feat / vectors Feature: Word vectors and similarity

Comments

@mmaybeno
Copy link
Contributor

mmaybeno commented Nov 19, 2019

How to reproduce the behaviour

I found this bug that I'm not sure if it resides in spacy or cupy. It only appears on GPU instances and when you try to get vectors from a multiple word document containing non standard words. Any help tracking it down with a potential fix would be fantastic.

import en_core_web_md
import spacy
spacy.prefer_gpu()
nlp = en_core_web_md.load()

doc = nlp("somerandomword")
doc.vector_norm
# works

doc = nlp("somerandomword.")
doc.vector_norm
# throws type error

doc = nlp("The somerandomword")
doc.vector_norm
# throws type error
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-96-d19c7b8f8943> in <module>()
----> 1 doc.vector_norm
doc.pyx in spacy.tokens.doc.Doc.vector_norm.__get__()
doc.pyx in __iter__()
cupy/core/core.pyx in cupy.core.core.ndarray.__add__()
cupy/core/_kernel.pyx in cupy.core._kernel.ufunc.__call__()
cupy/core/_kernel.pyx in cupy.core._kernel._preprocess_args()
TypeError: Unsupported type <class 'numpy.ndarray'>

Your Environment

  • Operating System: Ubuntu 18.04.3
  • Python Version Used: 3.6.8
  • spaCy Version Used: 2.2.2
  • Environment Information: Running on Google Colab but also experienced it on other GPU instances.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.50       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   34C    P8    26W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
!pip install spacy==2.2.2
!pip install chainer
!pip install thinc_gpu_ops thinc
!python -m spacy download en_core_web_md 
@mmaybeno
Copy link
Contributor Author

For context. I found this cupy PR that was merged recently and thought that would fix it, but building from source didn't appear to fix this issue.
cupy/cupy#2611

@mmaybeno
Copy link
Contributor Author

I think it has to do with this part specifically.
sum(t.vector for t in self)
https:/explosion/spaCy/blob/master/spacy/tokens/doc.pyx#L439

@mmaybeno
Copy link
Contributor Author

Found it. There is a case where a token will have an empty numpy.ndarray instead of a cupy.core.core.ndarray, which makes incompatible types when you try and sum them.

[type(t.vector) for t in nlp("somerandomword.")]
# [numpy.ndarray, cupy.core.core.ndarray]

@mmaybeno
Copy link
Contributor Author

Vocab's get_vector defaults to a numpy array, so if the word does not exist it will stay a zero numpy array. I think this is the bug. https:/explosion/spaCy/blob/master/spacy/vocab.pyx#L364

@mmaybeno
Copy link
Contributor Author

Attempting to create a PR for this fix but unsure on how to test it since it's cupy related.

@adrianeboyd adrianeboyd added bug Bugs and behaviour differing from documentation feat / vectors Feature: Word vectors and similarity labels Nov 20, 2019
@mmaybeno mmaybeno mentioned this issue Nov 20, 2019
3 tasks
@adrianeboyd
Copy link
Contributor

Fixed by #4680.

@rjurney
Copy link

rjurney commented Dec 10, 2019

Awesome!

@lock
Copy link

lock bot commented Jan 9, 2020

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Jan 9, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Bugs and behaviour differing from documentation feat / vectors Feature: Word vectors and similarity
Projects
None yet
Development

No branches or pull requests

3 participants