Update lemma and vector information after splitting a token #4097

svlandeg · 2019-08-08T09:14:14Z

Description

This PR fixes two related bugs that happened after running retokenizer.split:

The lemma information was not updated and was still referring to the old token.text attributes. This is fixed/reset by calling token.lemma = 0 on the split tokens.
The vector attributes were left unchanged, causing an out-of-bounds error as described in Issue Problem with vector attribute after calling Retokenizer.split #3540. This is fixed by extending the doc.tensor and setting the vectors of the split tokens to an array of zeros (not sure what else to do).

Types of change

bug fix

Checklist

I have submitted the spaCy Contributor Agreement.
I ran the tests, and all new and existing tests passed.
My changes don't require a change to the documentation, or if they do, I've added all required information.

ines · 2019-08-08T09:17:17Z

spacy/tests/regression/test_issue3540.py

+def test_issue3540(en_vocab):
+
+ words = ["I", "live", "in", "NewYork", "right", "now"]
+ tensor = np.asarray([[1.0, 1.1], [2.0, 2.1], [3.0, 3.1], [4.0, 4.1], [5.0, 5.1], [6.0, 6.1]], dtype="f")


That's great! 👍 (We should remember this test as a nice example for how to test even pretty complex stuff related to the models without having to actually load or run them.)

ines · 2019-08-08T11:05:42Z

Hmm, looks like the build failure is a problem with Azure Pipelines:

##[Error 1]
The agent: Hosted Agent lost communication with the server. Verify the machine is running and has a healthy network connection. For more information, see: https://go.microsoft.com/fwlink/?linkid=846610

I'll restart! 🤞

ines · 2019-08-08T12:19:28Z

Hmmm, now this random matcher (?) test keeps failing again randomly on only one configuration... wtf... 🤔

svlandeg · 2019-08-08T12:27:02Z

Where do you see the failing test? I only still see the "Hosted Agent lost communication" error.

ines · 2019-08-08T12:35:53Z

@svlandeg See here for all builds: https://dev.azure.com/explosion-ai/Public/_build?definitionId=8&_a=summary

Restarted the test again and it now passed 🤷‍♀️ (just doesn't seem to be reflected in the PR)

svlandeg · 2019-08-08T12:42:50Z

Weird 😬
I can push a dummy commit and see what happens? On one of the other PR's the other day I had to do the same because the PR wasn't updating itself properly...

honnibal · 2019-08-08T13:01:07Z

Patch looks good, thanks!

ines · 2019-08-08T13:09:34Z

I'll just go ahead and merge this! There really seems to be nothing wrong with this PR – there's likely something wrong somewhere, though, that makes the tests flaky like this.

…n#4097) * fixing vector and lemma attributes after retokenizer.split * fixing unit test with mockup tensor * xp instead of numpy

svlandeg added 3 commits August 8, 2019 10:43

fixing vector and lemma attributes after retokenizer.split

0d167e9

fixing unit test with mockup tensor

c871c4f

xp instead of numpy

6a66212

svlandeg mentioned this pull request Aug 8, 2019

Problem with vector attribute after calling Retokenizer.split #3540

Closed

svlandeg added bug Bugs and behaviour differing from documentation feat / doc Feature: Doc, Span and Token objects feat / vectors Feature: Word vectors and similarity labels Aug 8, 2019

ines reviewed Aug 8, 2019

View reviewed changes

ines merged commit 963ea5e into explosion:master Aug 8, 2019

svlandeg deleted the bugfix/retokenizer-vectors branch August 8, 2019 13:35

svlandeg mentioned this pull request Aug 8, 2019

Flaky matcher test and ghost IDs #4098

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update lemma and vector information after splitting a token #4097

Update lemma and vector information after splitting a token #4097

svlandeg commented Aug 8, 2019

ines Aug 8, 2019

ines commented Aug 8, 2019

ines commented Aug 8, 2019

svlandeg commented Aug 8, 2019

ines commented Aug 8, 2019 •

edited

Loading

svlandeg commented Aug 8, 2019

honnibal commented Aug 8, 2019

ines commented Aug 8, 2019

Update lemma and vector information after splitting a token #4097

Update lemma and vector information after splitting a token #4097

Conversation

svlandeg commented Aug 8, 2019

Description

Types of change

Checklist

ines Aug 8, 2019

Choose a reason for hiding this comment

ines commented Aug 8, 2019

ines commented Aug 8, 2019

svlandeg commented Aug 8, 2019

ines commented Aug 8, 2019 • edited Loading

svlandeg commented Aug 8, 2019

honnibal commented Aug 8, 2019

ines commented Aug 8, 2019

ines commented Aug 8, 2019 •

edited

Loading