Vocab serialization/deserialization leads to incomplete document #4133

tomnaumann · 2019-08-16T15:56:47Z

It seems that Spacy has an issue with transforming a document to a byte-array and the other way around - because when I do so, some information e.g. part-of-speech data is missing.

I already figured out that it works when I load the document directly with the model vocab - which means that the bug is most likely happening during the serialization resp. deserialization of vocab.
doc = Doc(nlp.vocab).from_bytes(doc_bytes)

How to reproduce the behaviour

from spacy import load
from spacy.tokens import Doc
from spacy.vocab import Vocab

class SpacySaveLoadTest(unittest.TestCase):

    def test_foo(self):
        nlp = load('en_core_web_sm')
        vocab_bytes = nlp.vocab.to_bytes()
        doc = nlp(u'Apple is looking at buying U.K. startup for $1 billion')
        doc_bytes = doc.to_bytes()

        expected = []
        for token in doc:
            expected.append(token.pos_)

        vocab = Vocab()
        vocab.from_bytes(vocab_bytes)
        doc = Doc(vocab).from_bytes(doc_bytes)

        actual = []
        for token in doc:
            actual.append(token.pos_)

        print(actual)
        print(expected)
        self.assertEqual(actual, expected)

Your Environment

spaCy version: 2.1.8
Platform: Windows-10-10.0.17134-SP0
Python version: 3.7.4

The text was updated successfully, but these errors were encountered:

svlandeg · 2019-08-16T17:17:00Z

Thanks for the report! You're right, this looks like a bug in Vocab.to_bytes(). Will investigate :-)

svlandeg · 2019-08-16T18:05:30Z

It looks like this is another manifestation of the same bug we identified recently around serialization of pos attributes, cf Issue #3959 and the currently open pull request #4092. I added a unit test for this particular example - merging it with the other cases and closing this one.

(FYI @Criffle12 : until the PR is merged, perhaps you can use token.tag_ instead of token.pos_, because the tags are serialized correctly)

lock · 2019-09-15T18:42:48Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

svlandeg added feat / doc Feature: Doc, Span and Token objects feat / serialize Feature: Serialization, saving and loading labels Aug 16, 2019

svlandeg added the bug Bugs and behaviour differing from documentation label Aug 16, 2019

svlandeg closed this as completed Aug 16, 2019

lock bot locked as resolved and limited conversation to collaborators Sep 15, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vocab serialization/deserialization leads to incomplete document #4133

Vocab serialization/deserialization leads to incomplete document #4133

tomnaumann commented Aug 16, 2019

svlandeg commented Aug 16, 2019

svlandeg commented Aug 16, 2019 •

edited

Loading

lock bot commented Sep 15, 2019

Vocab serialization/deserialization leads to incomplete document #4133

Vocab serialization/deserialization leads to incomplete document #4133

Comments

tomnaumann commented Aug 16, 2019

How to reproduce the behaviour

Your Environment

svlandeg commented Aug 16, 2019

svlandeg commented Aug 16, 2019 • edited Loading

lock bot commented Sep 15, 2019

svlandeg commented Aug 16, 2019 •

edited

Loading