Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Copying" a Doc with to_array/from_array does not yield the same SENT_START #4440

Closed
Wirg opened this issue Oct 14, 2019 · 4 comments · Fixed by #5011
Closed

"Copying" a Doc with to_array/from_array does not yield the same SENT_START #4440

Wirg opened this issue Oct 14, 2019 · 4 comments · Fixed by #5011
Labels
bug Bugs and behaviour differing from documentation feat / doc Feature: Doc, Span and Token objects

Comments

@Wirg
Copy link

Wirg commented Oct 14, 2019

Context

I am playing around building new Docs in a pipeline to remove sentences (https://stackoverflow.com/questions/58368208/filtering-out-sentences-between-sentence-segmentation-and-other-spacy-components?noredirect=1#comment103103941_58368208).
I noticed that I could not make SENT_START work with from_array. I expect to have the same sentences that what I had previously and end up with one sentence by token.

How to reproduce the behaviour

import spacy
from spacy.tokens import Doc
from spacy import attrs
import numpy as np

ATTRIBUTES_TO_RESTORE = [attrs.ORTH, attrs.LEMMA, attrs.SHAPE, attrs.IS_STOP, attrs.DEP, attrs.LOWER, attrs.POS, attrs.TAG, attrs.IS_ALPHA, attrs.SENT_START]

def copy_doc(doc: Doc) -> Doc:
    return Doc(
        doc.vocab,
        words=list(map(str, doc)),
        spaces=list(map(lambda token: token.whitespace_ != '', doc))
    ).from_array(ATTRIBUTES_TO_RESTORE, doc.to_array(ATTRIBUTES_TO_RESTORE))


nlp = spacy.load('en_core_web_sm')
doc = nlp("This is a short sentence. Too short. This is a really really long sentence with a lot of junk.")
doc_copy = copy_doc(doc)
for attr in ATTRIBUTES_TO_RESTORE:
    print(attrs.NAMES[attr])
    # Expect no errors, raise for attrs.SENT_START
    np.testing.assert_array_equal(doc.to_array([attr]), doc_copy.to_array([attr]))

Your Environment

  • spaCy version: 2.2.0
  • Platform: Linux-4.4.0-18362-Microsoft-x86_64-with-Ubuntu-18.04-bionic
  • Python version: 3.6.7

EDIT:
After some fiddling, it seems like I should not be using attrs.SENT_START but attrs.HEAD in my case as I use dep.
It seems like I should probably have an error message :

spaCy/spacy/tokens/doc.pyx

Lines 785 to 786 in f2d2247

if SENT_START in attrs and HEAD in attrs:
raise ValueError(Errors.E032)

So :

  • ATTRIBUTES_TO_RESTORE = [attrs.SENT_START] works if I don't have HEAD and/or DEP
  • ATTRIBUTES_TO_RESTORE = [attrs.HEAD, attrs.DEP] works if I don't have SENT_START

I guess that the previous snippet should be replaced by something like :

        if SENT_START in attrs and (HEAD in attrs or DEP in attrs):
            raise ValueError(Errors.E032)

Should I do a PR ?

@svlandeg svlandeg added bug Bugs and behaviour differing from documentation feat / doc Feature: Doc, Span and Token objects labels Oct 15, 2019
@svlandeg
Copy link
Member

svlandeg commented Oct 15, 2019

Hi @Wirg, thanks for the report and the detailed analysis! I think you're right: your original case should have thrown an error because SENT_START shouldn't be used if the document is parsed. It would be good to make this function more robust and the error message more transparent. If you create the PR - feel free to include your original case as a regression test in https:/explosion/spaCy/tree/master/spacy/tests/regression

@adrianeboyd
Copy link
Contributor

If you haven't seen it already, check out Span.as_doc() which shows how to handle a lot of the related issues (SENT_START vs. HEAD/DEP, adjusting dependency arcs, etc.).

spaCy/spacy/tokens/span.pyx

Lines 203 to 242 in 258eb9e

def as_doc(self, bint copy_user_data=False):
"""Create a `Doc` object with a copy of the `Span`'s data.
copy_user_data (bool): Whether or not to copy the original doc's user data.
RETURNS (Doc): The `Doc` copy of the span.
DOCS: https://spacy.io/api/span#as_doc
"""
# TODO: make copy_user_data a keyword-only argument (Python 3 only)
words = [t.text for t in self]
spaces = [bool(t.whitespace_) for t in self]
cdef Doc doc = Doc(self.doc.vocab, words=words, spaces=spaces)
array_head = [LENGTH, SPACY, LEMMA, ENT_IOB, ENT_TYPE, ENT_KB_ID]
if self.doc.is_tagged:
array_head.append(TAG)
# If doc parsed add head and dep attribute
if self.doc.is_parsed:
array_head.extend([HEAD, DEP])
# Otherwise add sent_start
else:
array_head.append(SENT_START)
array = self.doc.to_array(array_head)
array = array[self.start : self.end]
self._fix_dep_copy(array_head, array)
doc.from_array(array_head, array)
doc.noun_chunks_iterator = self.doc.noun_chunks_iterator
doc.user_hooks = self.doc.user_hooks
doc.user_span_hooks = self.doc.user_span_hooks
doc.user_token_hooks = self.doc.user_token_hooks
doc.vector = self.vector
doc.vector_norm = self.vector_norm
doc.tensor = self.doc.tensor[self.start : self.end]
for key, value in self.doc.cats.items():
if hasattr(key, "__len__") and len(key) == 3:
cat_start, cat_end, cat_label = key
if cat_start == self.start_char and cat_end == self.end_char:
doc.cats[cat_label] = value
if copy_user_data:
doc.user_data = self.doc.user_data
return doc

@adrianeboyd
Copy link
Contributor

Just coming back to this...

I think that the original check (for HEAD) is correct because DEP alone doesn't provide any sentence boundaries, so the conflict is only between SENT_START and HEAD.

The problem is that DEP alone sets self.is_parsed and then set_children_from_heads() sees lots of 0 heads and modifies all the sentence boundaries. I think is_parsed should just be determined from HEAD and not DEP.

@lock
Copy link

lock bot commented Mar 17, 2020

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Mar 17, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Bugs and behaviour differing from documentation feat / doc Feature: Doc, Span and Token objects
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants