Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vocab serialization/deserialization leads to incomplete document #4133

Closed
tomnaumann opened this issue Aug 16, 2019 · 3 comments
Closed

Vocab serialization/deserialization leads to incomplete document #4133

tomnaumann opened this issue Aug 16, 2019 · 3 comments
Labels
bug Bugs and behaviour differing from documentation feat / doc Feature: Doc, Span and Token objects feat / serialize Feature: Serialization, saving and loading

Comments

@tomnaumann
Copy link

It seems that Spacy has an issue with transforming a document to a byte-array and the other way around - because when I do so, some information e.g. part-of-speech data is missing.

I already figured out that it works when I load the document directly with the model vocab - which means that the bug is most likely happening during the serialization resp. deserialization of vocab.
doc = Doc(nlp.vocab).from_bytes(doc_bytes)

How to reproduce the behaviour

from spacy import load
from spacy.tokens import Doc
from spacy.vocab import Vocab

class SpacySaveLoadTest(unittest.TestCase):

    def test_foo(self):
        nlp = load('en_core_web_sm')
        vocab_bytes = nlp.vocab.to_bytes()
        doc = nlp(u'Apple is looking at buying U.K. startup for $1 billion')
        doc_bytes = doc.to_bytes()

        expected = []
        for token in doc:
            expected.append(token.pos_)

        vocab = Vocab()
        vocab.from_bytes(vocab_bytes)
        doc = Doc(vocab).from_bytes(doc_bytes)

        actual = []
        for token in doc:
            actual.append(token.pos_)

        print(actual)
        print(expected)
        self.assertEqual(actual, expected)

Your Environment

  • spaCy version: 2.1.8
  • Platform: Windows-10-10.0.17134-SP0
  • Python version: 3.7.4
@svlandeg svlandeg added feat / doc Feature: Doc, Span and Token objects feat / serialize Feature: Serialization, saving and loading labels Aug 16, 2019
@svlandeg
Copy link
Member

Thanks for the report! You're right, this looks like a bug in Vocab.to_bytes(). Will investigate :-)

@svlandeg svlandeg added the bug Bugs and behaviour differing from documentation label Aug 16, 2019
@svlandeg
Copy link
Member

svlandeg commented Aug 16, 2019

It looks like this is another manifestation of the same bug we identified recently around serialization of pos attributes, cf Issue #3959 and the currently open pull request #4092. I added a unit test for this particular example - merging it with the other cases and closing this one.

(FYI @Criffle12 : until the PR is merged, perhaps you can use token.tag_ instead of token.pos_, because the tags are serialized correctly)

@lock
Copy link

lock bot commented Sep 15, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Sep 15, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Bugs and behaviour differing from documentation feat / doc Feature: Doc, Span and Token objects feat / serialize Feature: Serialization, saving and loading
Projects
None yet
Development

No branches or pull requests

2 participants