Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different sentence spans on Document and Token level #5435

Closed
tobiasblasberg opened this issue May 14, 2020 · 7 comments · Fixed by #5439
Closed

Different sentence spans on Document and Token level #5435

tobiasblasberg opened this issue May 14, 2020 · 7 comments · Fixed by #5439
Labels
bug Bugs and behaviour differing from documentation feat / doc Feature: Doc, Span and Token objects models Issues related to the statistical models

Comments

@tobiasblasberg
Copy link

How to reproduce the behaviour

I would like to extract the sentence index of a token in a doc. The current workaround uses token.sent and comparing the span with the sentence list of the doc.

Issue: using token.sent results in some cases in different sentence spans than sentences from doc.sents:

import spacy
nlp = spacy.load("en_core_web_md")
text = "Very satisfied!. This product definitely met my expectations. I ordered a refurbished iPhone 4s and it was exactly like it was described: minor scratches on the back (you can not see them unless it has the right kind of light and I have a case on it now anyway), brand new screen with screen protector, and works like new. I have had no problems with it at all. I ordered it and I was scheduled to receive it a week later, but it was in my mailbox four days early. I am extremely satisfied with this product as well as this company. I will probably buy another electronic device from Laptop Angels because they are very trustworthy and honest. If you are looking to buy just an iPhone for a cheaper price than what is in the store, I would tell you to buy it from Laptop Angels. Thank you so much for your honest business. I am a very satisfied customer! :)"

doc = nlp(text)
sentences = [sent for sent in doc.sents]
token = doc[14] #refurbished
token.sent == sentences[2] # False

sentences[2] # I ordered a refurbished iPhone 4s and it was exactly like it was described: minor scratches on the back (you can not see them unless it has the right kind of light
token.sent # I ordered a refurbished iPhone 4s and it was exactly like it was described:

Your Environment

  • Python Version Used: 3.8.2
  • spaCy Version Used: 2.2.4
  • Environment Information: Docker image python:3 (Linux)
@svlandeg
Copy link
Member

Thanks for the report. What is weird though, is that I can't replicate this. With the same model and the same spaCy version, both sentences[2] and token.sent are "I ordered a refurbished iPhone 4s and it was exactly like it was described:" on my end.

Could you paste the results of running this:

for s in sentences:
    print("****", s, "****")

Could you also provide the output of running python -m spacy validate in the console ?

@svlandeg svlandeg added feat / doc Feature: Doc, Span and Token objects more-info-needed This issue needs more information labels May 14, 2020
@tobiasblasberg
Copy link
Author

Thanks for the quick reply.

Here is the required info:

for s in sentences:
    print("****", s, "****")

**** Very satisfied!. ****
**** This product definitely met my expectations. ****
**** I ordered a refurbished iPhone 4s and it was exactly like it was described: minor scratches on the back (you can not see them unless it has the right kind of light ****
**** and I have a case on it now anyway), ****
**** brand new screen with screen protector, and works like new. ****
**** I have had no problems with it at all. ****
**** I ordered it ****
**** and I was scheduled to receive it a week later, but it was in my mailbox four days early. ****
**** I am extremely satisfied with this product as well as this company. ****
**** I will probably buy another electronic device from Laptop Angels because they are very trustworthy and honest. ****
**** If you are looking to buy just an iPhone for a cheaper price than what is in the store, I would tell you to buy it from Laptop Angels. ****
**** Thank you so much for your honest business. ****
**** I am a very satisfied customer! :) ****

and token.sent results in:
I ordered a refurbished iPhone 4s and it was exactly like it was described:

For further tests, I now added _sm and _lg models to the docker container but originally, just the _md model was loaded and used. This is the output of python -m spacy validate

✔ Loaded compatibility table

====================== Installed models (spaCy v2.2.4) ======================
ℹ spaCy installation: /usr/local/lib/python3.8/site-packages/spacy

TYPE NAME MODEL VERSION
package en-core-web-sm en_core_web_sm 2.2.5 ✔
package en-core-web-md en_core_web_md 2.2.5 ✔
package en-core-web-lg en_core_web_lg 2.2.5 ✔

@no-response no-response bot removed the more-info-needed This issue needs more information label May 14, 2020
@tobiasblasberg
Copy link
Author

Here is another strange behavior of token.sent, where the token is part of the span of token.sent:

for token in doc:
    print(token.sent, token) 

Part of the output:

...
I ordered a refurbished iPhone 4s and it was exactly like it was described: I
I ordered a refurbished iPhone 4s and it was exactly like it was described: ordered
I ordered a refurbished iPhone 4s and it was exactly like it was described: a
I ordered a refurbished iPhone 4s and it was exactly like it was described: refurbished
I ordered a refurbished iPhone 4s and it was exactly like it was described: iPhone
I ordered a refurbished iPhone 4s and it was exactly like it was described: 4s
I ordered a refurbished iPhone 4s and it was exactly like it was described: and
I ordered a refurbished iPhone 4s and it was exactly like it was described: it
I ordered a refurbished iPhone 4s and it was exactly like it was described: was
I ordered a refurbished iPhone 4s and it was exactly like it was described: exactly
I ordered a refurbished iPhone 4s and it was exactly like it was described: like
I ordered a refurbished iPhone 4s and it was exactly like it was described: it
I ordered a refurbished iPhone 4s and it was exactly like it was described: was
I ordered a refurbished iPhone 4s and it was exactly like it was described: described
I ordered a refurbished iPhone 4s and it was exactly like it was described: :
I ordered a refurbished iPhone 4s and it was exactly like it was described: minor
I ordered a refurbished iPhone 4s and it was exactly like it was described: scratches
I ordered a refurbished iPhone 4s and it was exactly like it was described: on
I ordered a refurbished iPhone 4s and it was exactly like it was described: the
I ordered a refurbished iPhone 4s and it was exactly like it was described: back
I ordered a refurbished iPhone 4s and it was exactly like it was described: (
I ordered a refurbished iPhone 4s and it was exactly like it was described: you
I ordered a refurbished iPhone 4s and it was exactly like it was described: can
I ordered a refurbished iPhone 4s and it was exactly like it was described: not
I ordered a refurbished iPhone 4s and it was exactly like it was described: see
I ordered a refurbished iPhone 4s and it was exactly like it was described: them
I ordered a refurbished iPhone 4s and it was exactly like it was described: unless
I ordered a refurbished iPhone 4s and it was exactly like it was described: it
I ordered a refurbished iPhone 4s and it was exactly like it was described: has
I ordered a refurbished iPhone 4s and it was exactly like it was described: the
I ordered a refurbished iPhone 4s and it was exactly like it was described: right
I ordered a refurbished iPhone 4s and it was exactly like it was described: kind
I ordered a refurbished iPhone 4s and it was exactly like it was described: of
I ordered a refurbished iPhone 4s and it was exactly like it was described: light

...

@svlandeg
Copy link
Member

Ok, so you are using en_core_web_md 2.2.5 and I still had en_core_web_md 2.2.0, which explains the difference. So, the good news is, I can now replicate this with en_core_web_md 2.2.5.

The other good news is, you can probably revert your model back to 2.2.0 and all this weirdness should go away.

The bad news is, this definitely looks like a bug :(
We'll look into it!

@svlandeg svlandeg added bug Bugs and behaviour differing from documentation models Issues related to the statistical models labels May 14, 2020
@tobiasblasberg
Copy link
Author

Thank you for the solution! V2.2.0 works fine for this bug.

A little bit off topic but related to the same example: we are still struggling with the results of the sentence splitter in general, which seems to split sentences in this example in the middle of brackets:
**** minor scratches on the back (you can not see them unless it has the right kind of light ****
**** and I have a case on it now anyway), ****

Is there a recommended Sentence Splitter that you would suggest in this case?

@adrianeboyd
Copy link
Contributor

adrianeboyd commented May 14, 2020

I haven't used it, but I know of this rule-based SBD tool for English, which I think does more counting of matching paired punctuation: https://spacy.io/universe/project/python-sentence-boundary-disambiguation

The different model versions produce different parses, which leads to the differing behavior. (So at least it's not a bug related to model itself!)

If a parse is available, Span.sent tries to analyze the parse to find sentence boundaries instead of using Token.sent_start, which it what Doc.sents uses. I think there's a bug in how it's finding the root or the parse boundaries, since the parser is what has set sent_start in the first place (from the exact same parses).

Since doc.sents just uses sent_start, I think Span.sent could be modified to also just use sent_start, but it would be nice to understand what the bug is in the parse analysis. I suspect it's related to some of the l_edge/r_edge problems with non-projective dependency trees when you start modifying the parse on-the-fly, too.

@github-actions
Copy link
Contributor

github-actions bot commented Nov 5, 2021

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 5, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Bugs and behaviour differing from documentation feat / doc Feature: Doc, Span and Token objects models Issues related to the statistical models
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants