Different sentence spans on Document and Token level #5435

tobiasblasberg · 2020-05-14T06:59:49Z

How to reproduce the behaviour

I would like to extract the sentence index of a token in a doc. The current workaround uses token.sent and comparing the span with the sentence list of the doc.

Issue: using token.sent results in some cases in different sentence spans than sentences from doc.sents:

import spacy
nlp = spacy.load("en_core_web_md")
text = "Very satisfied!. This product definitely met my expectations. I ordered a refurbished iPhone 4s and it was exactly like it was described: minor scratches on the back (you can not see them unless it has the right kind of light and I have a case on it now anyway), brand new screen with screen protector, and works like new. I have had no problems with it at all. I ordered it and I was scheduled to receive it a week later, but it was in my mailbox four days early. I am extremely satisfied with this product as well as this company. I will probably buy another electronic device from Laptop Angels because they are very trustworthy and honest. If you are looking to buy just an iPhone for a cheaper price than what is in the store, I would tell you to buy it from Laptop Angels. Thank you so much for your honest business. I am a very satisfied customer! :)"

doc = nlp(text)
sentences = [sent for sent in doc.sents]
token = doc[14] #refurbished
token.sent == sentences[2] # False

sentences[2] # I ordered a refurbished iPhone 4s and it was exactly like it was described: minor scratches on the back (you can not see them unless it has the right kind of light
token.sent # I ordered a refurbished iPhone 4s and it was exactly like it was described:

Your Environment

Python Version Used: 3.8.2
spaCy Version Used: 2.2.4
Environment Information: Docker image python:3 (Linux)

The text was updated successfully, but these errors were encountered:

svlandeg · 2020-05-14T07:29:31Z

Thanks for the report. What is weird though, is that I can't replicate this. With the same model and the same spaCy version, both sentences[2] and token.sent are "I ordered a refurbished iPhone 4s and it was exactly like it was described:" on my end.

Could you paste the results of running this:

for s in sentences:
    print("****", s, "****")

Could you also provide the output of running python -m spacy validate in the console ?

tobiasblasberg · 2020-05-14T08:08:31Z

Thanks for the quick reply.

Here is the required info:

for s in sentences:
    print("****", s, "****")

**** Very satisfied!. ****
**** This product definitely met my expectations. ****
**** I ordered a refurbished iPhone 4s and it was exactly like it was described: minor scratches on the back (you can not see them unless it has the right kind of light ****
**** and I have a case on it now anyway), ****
**** brand new screen with screen protector, and works like new. ****
**** I have had no problems with it at all. ****
**** I ordered it ****
**** and I was scheduled to receive it a week later, but it was in my mailbox four days early. ****
**** I am extremely satisfied with this product as well as this company. ****
**** I will probably buy another electronic device from Laptop Angels because they are very trustworthy and honest. ****
**** If you are looking to buy just an iPhone for a cheaper price than what is in the store, I would tell you to buy it from Laptop Angels. ****
**** Thank you so much for your honest business. ****
**** I am a very satisfied customer! :) ****

and token.sent results in:
I ordered a refurbished iPhone 4s and it was exactly like it was described:

For further tests, I now added _sm and _lg models to the docker container but originally, just the _md model was loaded and used. This is the output of python -m spacy validate

✔ Loaded compatibility table

====================== Installed models (spaCy v2.2.4) ======================
ℹ spaCy installation: /usr/local/lib/python3.8/site-packages/spacy

TYPE NAME MODEL VERSION
package en-core-web-sm en_core_web_sm 2.2.5 ✔
package en-core-web-md en_core_web_md 2.2.5 ✔
package en-core-web-lg en_core_web_lg 2.2.5 ✔

tobiasblasberg · 2020-05-14T08:36:23Z

Here is another strange behavior of token.sent, where the token is part of the span of token.sent:

for token in doc:
    print(token.sent, token)

Part of the output:

...
I ordered a refurbished iPhone 4s and it was exactly like it was described: I
I ordered a refurbished iPhone 4s and it was exactly like it was described: ordered
I ordered a refurbished iPhone 4s and it was exactly like it was described: a
I ordered a refurbished iPhone 4s and it was exactly like it was described: refurbished
I ordered a refurbished iPhone 4s and it was exactly like it was described: iPhone
I ordered a refurbished iPhone 4s and it was exactly like it was described: 4s
I ordered a refurbished iPhone 4s and it was exactly like it was described: and
I ordered a refurbished iPhone 4s and it was exactly like it was described: it
I ordered a refurbished iPhone 4s and it was exactly like it was described: was
I ordered a refurbished iPhone 4s and it was exactly like it was described: exactly
I ordered a refurbished iPhone 4s and it was exactly like it was described: like
I ordered a refurbished iPhone 4s and it was exactly like it was described: it
I ordered a refurbished iPhone 4s and it was exactly like it was described: was
I ordered a refurbished iPhone 4s and it was exactly like it was described: described
I ordered a refurbished iPhone 4s and it was exactly like it was described: :
I ordered a refurbished iPhone 4s and it was exactly like it was described: minor
I ordered a refurbished iPhone 4s and it was exactly like it was described: scratches
I ordered a refurbished iPhone 4s and it was exactly like it was described: on
I ordered a refurbished iPhone 4s and it was exactly like it was described: the
I ordered a refurbished iPhone 4s and it was exactly like it was described: back
I ordered a refurbished iPhone 4s and it was exactly like it was described: (
I ordered a refurbished iPhone 4s and it was exactly like it was described: you
I ordered a refurbished iPhone 4s and it was exactly like it was described: can
I ordered a refurbished iPhone 4s and it was exactly like it was described: not
I ordered a refurbished iPhone 4s and it was exactly like it was described: see
I ordered a refurbished iPhone 4s and it was exactly like it was described: them
I ordered a refurbished iPhone 4s and it was exactly like it was described: unless
I ordered a refurbished iPhone 4s and it was exactly like it was described: it
I ordered a refurbished iPhone 4s and it was exactly like it was described: has
I ordered a refurbished iPhone 4s and it was exactly like it was described: the
I ordered a refurbished iPhone 4s and it was exactly like it was described: right
I ordered a refurbished iPhone 4s and it was exactly like it was described: kind
I ordered a refurbished iPhone 4s and it was exactly like it was described: of
I ordered a refurbished iPhone 4s and it was exactly like it was described: light
...

svlandeg · 2020-05-14T08:59:10Z

Ok, so you are using en_core_web_md 2.2.5 and I still had en_core_web_md 2.2.0, which explains the difference. So, the good news is, I can now replicate this with en_core_web_md 2.2.5.

The other good news is, you can probably revert your model back to 2.2.0 and all this weirdness should go away.

The bad news is, this definitely looks like a bug :(
We'll look into it!

tobiasblasberg · 2020-05-14T10:10:46Z

Thank you for the solution! V2.2.0 works fine for this bug.

A little bit off topic but related to the same example: we are still struggling with the results of the sentence splitter in general, which seems to split sentences in this example in the middle of brackets:
**** minor scratches on the back (you can not see them unless it has the right kind of light ****
**** and I have a case on it now anyway), ****

Is there a recommended Sentence Splitter that you would suggest in this case?

adrianeboyd · 2020-05-14T14:09:16Z

I haven't used it, but I know of this rule-based SBD tool for English, which I think does more counting of matching paired punctuation: https://spacy.io/universe/project/python-sentence-boundary-disambiguation

The different model versions produce different parses, which leads to the differing behavior. (So at least it's not a bug related to model itself!)

If a parse is available, Span.sent tries to analyze the parse to find sentence boundaries instead of using Token.sent_start, which it what Doc.sents uses. I think there's a bug in how it's finding the root or the parse boundaries, since the parser is what has set sent_start in the first place (from the exact same parses).

Since doc.sents just uses sent_start, I think Span.sent could be modified to also just use sent_start, but it would be nice to understand what the bug is in the parse analysis. I suspect it's related to some of the l_edge/r_edge problems with non-projective dependency trees when you start modifying the parse on-the-fly, too.

github-actions · 2021-11-05T00:01:58Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

svlandeg added feat / doc Feature: Doc, Span and Token objects more-info-needed This issue needs more information labels May 14, 2020

no-response bot removed the more-info-needed This issue needs more information label May 14, 2020

svlandeg added bug Bugs and behaviour differing from documentation models Issues related to the statistical models labels May 14, 2020

adrianeboyd mentioned this issue May 14, 2020

Use Token.sent_start for Span.sent #5439

Merged

3 tasks

honnibal closed this as completed in #5439 May 14, 2020

github-actions bot locked as resolved and limited conversation to collaborators Nov 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Different sentence spans on Document and Token level #5435

Different sentence spans on Document and Token level #5435

tobiasblasberg commented May 14, 2020

svlandeg commented May 14, 2020

tobiasblasberg commented May 14, 2020

tobiasblasberg commented May 14, 2020

svlandeg commented May 14, 2020

tobiasblasberg commented May 14, 2020

adrianeboyd commented May 14, 2020 •

edited

Loading

github-actions bot commented Nov 5, 2021

Different sentence spans on Document and Token level #5435

Different sentence spans on Document and Token level #5435

Comments

tobiasblasberg commented May 14, 2020

How to reproduce the behaviour

Your Environment

svlandeg commented May 14, 2020

tobiasblasberg commented May 14, 2020

tobiasblasberg commented May 14, 2020

svlandeg commented May 14, 2020

tobiasblasberg commented May 14, 2020

adrianeboyd commented May 14, 2020 • edited Loading

github-actions bot commented Nov 5, 2021

adrianeboyd commented May 14, 2020 •

edited

Loading