Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update lemma and vector information after splitting a token #4097

Merged
merged 3 commits into from
Aug 8, 2019

Conversation

svlandeg
Copy link
Member

@svlandeg svlandeg commented Aug 8, 2019

Description

This PR fixes two related bugs that happened after running retokenizer.split:

  • The lemma information was not updated and was still referring to the old token.text attributes. This is fixed/reset by calling token.lemma = 0 on the split tokens.

  • The vector attributes were left unchanged, causing an out-of-bounds error as described in Issue Problem with vector attribute after calling Retokenizer.split #3540. This is fixed by extending the doc.tensor and setting the vectors of the split tokens to an array of zeros (not sure what else to do).

Types of change

bug fix

Checklist

  • I have submitted the spaCy Contributor Agreement.
  • I ran the tests, and all new and existing tests passed.
  • My changes don't require a change to the documentation, or if they do, I've added all required information.

@svlandeg svlandeg added bug Bugs and behaviour differing from documentation feat / doc Feature: Doc, Span and Token objects feat / vectors Feature: Word vectors and similarity labels Aug 8, 2019
def test_issue3540(en_vocab):

words = ["I", "live", "in", "NewYork", "right", "now"]
tensor = np.asarray([[1.0, 1.1], [2.0, 2.1], [3.0, 3.1], [4.0, 4.1], [5.0, 5.1], [6.0, 6.1]], dtype="f")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's great! 👍 (We should remember this test as a nice example for how to test even pretty complex stuff related to the models without having to actually load or run them.)

@ines
Copy link
Member

ines commented Aug 8, 2019

Hmm, looks like the build failure is a problem with Azure Pipelines:

##[Error 1]
The agent: Hosted Agent lost communication with the server. Verify the machine is running and has a healthy network connection. For more information, see: https://go.microsoft.com/fwlink/?linkid=846610

I'll restart! 🤞

@ines
Copy link
Member

ines commented Aug 8, 2019

Hmmm, now this random matcher (?) test keeps failing again randomly on only one configuration... wtf... 🤔

@svlandeg
Copy link
Member Author

svlandeg commented Aug 8, 2019

Where do you see the failing test? I only still see the "Hosted Agent lost communication" error.

@ines
Copy link
Member

ines commented Aug 8, 2019

@svlandeg See here for all builds: https://dev.azure.com/explosion-ai/Public/_build?definitionId=8&_a=summary

Restarted the test again and it now passed 🤷‍♀️ (just doesn't seem to be reflected in the PR)

@svlandeg
Copy link
Member Author

svlandeg commented Aug 8, 2019

Weird 😬
I can push a dummy commit and see what happens? On one of the other PR's the other day I had to do the same because the PR wasn't updating itself properly...

@honnibal
Copy link
Member

honnibal commented Aug 8, 2019

Patch looks good, thanks!

@ines
Copy link
Member

ines commented Aug 8, 2019

I'll just go ahead and merge this! There really seems to be nothing wrong with this PR – there's likely something wrong somewhere, though, that makes the tests flaky like this.

@ines ines merged commit 963ea5e into explosion:master Aug 8, 2019
@svlandeg svlandeg deleted the bugfix/retokenizer-vectors branch August 8, 2019 13:35
polm pushed a commit to polm/spaCy that referenced this pull request Aug 18, 2019
…n#4097)

* fixing vector and lemma attributes after retokenizer.split

* fixing unit test with mockup tensor

* xp instead of numpy
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bugs and behaviour differing from documentation feat / doc Feature: Doc, Span and Token objects feat / vectors Feature: Word vectors and similarity
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants