"Copying" a Doc with to_array/from_array does not yield the same SENT_START #4440

Wirg · 2019-10-14T21:06:39Z

Context

I am playing around building new Docs in a pipeline to remove sentences (https://stackoverflow.com/questions/58368208/filtering-out-sentences-between-sentence-segmentation-and-other-spacy-components?noredirect=1#comment103103941_58368208).
I noticed that I could not make SENT_START work with from_array. I expect to have the same sentences that what I had previously and end up with one sentence by token.

How to reproduce the behaviour

import spacy
from spacy.tokens import Doc
from spacy import attrs
import numpy as np

ATTRIBUTES_TO_RESTORE = [attrs.ORTH, attrs.LEMMA, attrs.SHAPE, attrs.IS_STOP, attrs.DEP, attrs.LOWER, attrs.POS, attrs.TAG, attrs.IS_ALPHA, attrs.SENT_START]

def copy_doc(doc: Doc) -> Doc:
    return Doc(
        doc.vocab,
        words=list(map(str, doc)),
        spaces=list(map(lambda token: token.whitespace_ != '', doc))
    ).from_array(ATTRIBUTES_TO_RESTORE, doc.to_array(ATTRIBUTES_TO_RESTORE))


nlp = spacy.load('en_core_web_sm')
doc = nlp("This is a short sentence. Too short. This is a really really long sentence with a lot of junk.")
doc_copy = copy_doc(doc)
for attr in ATTRIBUTES_TO_RESTORE:
    print(attrs.NAMES[attr])
    # Expect no errors, raise for attrs.SENT_START
    np.testing.assert_array_equal(doc.to_array([attr]), doc_copy.to_array([attr]))

Your Environment

spaCy version: 2.2.0
Platform: Linux-4.4.0-18362-Microsoft-x86_64-with-Ubuntu-18.04-bionic
Python version: 3.6.7

EDIT:
After some fiddling, it seems like I should not be using attrs.SENT_START but attrs.HEAD in my case as I use dep.
It seems like I should probably have an error message :

spaCy/spacy/tokens/doc.pyx

Lines 785 to 786 in f2d2247

 if SENT_START in attrs and HEAD in attrs: 

 raise ValueError(Errors.E032)

So :

ATTRIBUTES_TO_RESTORE = [attrs.SENT_START] works if I don't have HEAD and/or DEP
ATTRIBUTES_TO_RESTORE = [attrs.HEAD, attrs.DEP] works if I don't have SENT_START

I guess that the previous snippet should be replaced by something like :

        if SENT_START in attrs and (HEAD in attrs or DEP in attrs):
            raise ValueError(Errors.E032)

Should I do a PR ?

The text was updated successfully, but these errors were encountered:

svlandeg · 2019-10-15T08:10:54Z

Hi @Wirg, thanks for the report and the detailed analysis! I think you're right: your original case should have thrown an error because SENT_START shouldn't be used if the document is parsed. It would be good to make this function more robust and the error message more transparent. If you create the PR - feel free to include your original case as a regression test in https:/explosion/spaCy/tree/master/spacy/tests/regression

adrianeboyd · 2019-10-18T08:46:43Z

If you haven't seen it already, check out Span.as_doc() which shows how to handle a lot of the related issues (SENT_START vs. HEAD/DEP, adjusting dependency arcs, etc.).

spaCy/spacy/tokens/span.pyx

Lines 203 to 242 in 258eb9e

 def as_doc(self, bint copy_user_data=False): 

 """Create a `Doc` object with a copy of the `Span`'s data. 

  copy_user_data (bool): Whether or not to copy the original doc's user data. 

  RETURNS (Doc): The `Doc` copy of the span. 

  DOCS: https://spacy.io/api/span#as_doc 

  """ 

 # TODO: make copy_user_data a keyword-only argument (Python 3 only) 

 words = [t.text for t in self] 

 spaces = [bool(t.whitespace_) for t in self] 

 cdef Doc doc = Doc(self.doc.vocab, words=words, spaces=spaces) 

 array_head = [LENGTH, SPACY, LEMMA, ENT_IOB, ENT_TYPE, ENT_KB_ID] 

 if self.doc.is_tagged: 

 array_head.append(TAG) 

 # If doc parsed add head and dep attribute 

 if self.doc.is_parsed: 

 array_head.extend([HEAD, DEP]) 

 # Otherwise add sent_start 

 else: 

 array_head.append(SENT_START) 

 array = self.doc.to_array(array_head) 

 array = array[self.start : self.end] 

 self._fix_dep_copy(array_head, array) 

 doc.from_array(array_head, array) 

 doc.noun_chunks_iterator = self.doc.noun_chunks_iterator 

 doc.user_hooks = self.doc.user_hooks 

 doc.user_span_hooks = self.doc.user_span_hooks 

 doc.user_token_hooks = self.doc.user_token_hooks 

 doc.vector = self.vector 

 doc.vector_norm = self.vector_norm 

 doc.tensor = self.doc.tensor[self.start : self.end] 

 for key, value in self.doc.cats.items(): 

 if hasattr(key, "__len__") and len(key) == 3: 

 cat_start, cat_end, cat_label = key 

 if cat_start == self.start_char and cat_end == self.end_char: 

 doc.cats[cat_label] = value 

 if copy_user_data: 

 doc.user_data = self.doc.user_data 

 return doc

adrianeboyd · 2020-02-13T08:20:03Z

Just coming back to this...

I think that the original check (for HEAD) is correct because DEP alone doesn't provide any sentence boundaries, so the conflict is only between SENT_START and HEAD.

The problem is that DEP alone sets self.is_parsed and then set_children_from_heads() sees lots of 0 heads and modifies all the sentence boundaries. I think is_parsed should just be determined from HEAD and not DEP.

lock · 2020-03-17T16:37:16Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

svlandeg added bug Bugs and behaviour differing from documentation feat / doc Feature: Doc, Span and Token objects labels Oct 15, 2019

adrianeboyd mentioned this issue Feb 13, 2020

Require HEAD for is_parsed in Doc.from_array() #5011

Merged

3 tasks

honnibal closed this as completed in #5011 Feb 16, 2020

lock bot locked as resolved and limited conversation to collaborators Mar 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Copying" a Doc with to_array/from_array does not yield the same SENT_START #4440

"Copying" a Doc with to_array/from_array does not yield the same SENT_START #4440

Wirg commented Oct 14, 2019 •

edited

Loading

svlandeg commented Oct 15, 2019 •

edited

Loading

adrianeboyd commented Oct 18, 2019

adrianeboyd commented Feb 13, 2020

lock bot commented Mar 17, 2020

"Copying" a Doc with to_array/from_array does not yield the same SENT_START #4440

"Copying" a Doc with to_array/from_array does not yield the same SENT_START #4440

Comments

Wirg commented Oct 14, 2019 • edited Loading

Context

How to reproduce the behaviour

Your Environment

svlandeg commented Oct 15, 2019 • edited Loading

adrianeboyd commented Oct 18, 2019

adrianeboyd commented Feb 13, 2020

lock bot commented Mar 17, 2020

Wirg commented Oct 14, 2019 •

edited

Loading

svlandeg commented Oct 15, 2019 •

edited

Loading