Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some PoS tags not restored on deserialization of doc in specific conditions #1773

Closed
csvance opened this issue Dec 27, 2017 · 2 comments
Closed
Labels
bug Bugs and behaviour differing from documentation feat / serialize Feature: Serialization, saving and loading

Comments

@csvance
Copy link

csvance commented Dec 27, 2017

Hi, when I serialize and deserialize a Doc with to_bytes() and from_bytes(), the .pos attribute of Token in specific conditions is not being restored. I have specifically observed this with SPACE and PUNCT.

import spacy
from spacy.tokens import Doc

sentence = "This is a sentence.\nAnother line. White   \n Space \n\n  Abuse   "

nlp = spacy.load('en')

original_doc = nlp(sentence)
original_len = len(original_doc)
original_bytes = original_doc.to_bytes()

deserialized_doc = Doc(nlp.vocab).from_bytes(original_bytes)
deserialized_len = len(deserialized_doc)
deserialized_bytes = deserialized_doc.to_bytes()

assert original_bytes == deserialized_bytes
assert original_len == deserialized_len
for idx in range(0, original_len):
    print("Original(%d): %s New(%d): %s" % (
    original_doc[idx].pos, original_doc[idx].pos_, deserialized_doc[idx].pos, deserialized_doc[idx].pos_))
    try:
        assert original_doc[idx].pos == deserialized_doc[idx].pos
    except AssertionError:
        print("Assertation failed.")

Original(89): DET New(89): DET
Original(99): VERB New(99): VERB
Original(89): DET New(89): DET
Original(91): NOUN New(91): NOUN
Original(96): PUNCT New(96): PUNCT
Original(102): SPACE New(0):
Assertation failed.
Original(89): DET New(89): DET
Original(91): NOUN New(91): NOUN
Original(96): PUNCT New(96): PUNCT
Original(95): PROPN New(95): PROPN
Original(102): SPACE New(102): SPACE
Original(95): PROPN New(95): PROPN
Original(102): SPACE New(102): SPACE
Original(95): PROPN New(95): PROPN
Original(102): SPACE New(0):
Assertation failed.

Heres an example with punctuation:

sentence = "Blah — Blahh — Blahhh"

Original(95): PROPN New(95): PROPN
Original(96): PUNCT New(0):
Assertation failed.
Original(95): PROPN New(95): PROPN
Original(96): PUNCT New(0):
Assertation failed.
Original(95): PROPN New(95): PROPN

However sentence = "Blah —— Blahh —— Blahhh" works just fine.

Your Environment

Info about spaCy

  • I have tried with both the 2.0.5 release and building from master.
  • spaCy version: 2.0.0rc2 (master ff9fc94)
  • Platform: Linux-4.10.0-40-generic-x86_64-with-debian-stretch-sid
  • Python version: 3.6.3
  • Models: en
  • Operating System: Ubuntu 16.04.3 LTS
@csvance csvance changed the title First SPACE PoS tag not restored on deserialization of doc First Newline SPACE PoS tag not restored on deserialization of doc when immediately after another char Dec 27, 2017
@csvance csvance changed the title First Newline SPACE PoS tag not restored on deserialization of doc when immediately after another char First Newline SPACE PoS tag not restored on deserialization of doc in specific condition Dec 27, 2017
@csvance csvance changed the title First Newline SPACE PoS tag not restored on deserialization of doc in specific condition First Newline SPACE PoS tag not restored on deserialization of doc with specific condition Dec 27, 2017
@csvance csvance changed the title First Newline SPACE PoS tag not restored on deserialization of doc with specific condition Some PoS tags not restored on deserialization of doc in specific conditions Dec 27, 2017
@honnibal honnibal added the bug Bugs and behaviour differing from documentation label Jan 12, 2018
@ines ines added the feat / serialize Feature: Serialization, saving and loading label Mar 27, 2018
honnibal added a commit that referenced this issue Dec 30, 2018
@honnibal
Copy link
Member

Sorry for the delay getting to this, and thanks for the nice report. Fixed now.

@lock
Copy link

lock bot commented Jan 29, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Jan 29, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Bugs and behaviour differing from documentation feat / serialize Feature: Serialization, saving and loading
Projects
None yet
Development

No branches or pull requests

3 participants