-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Different sentence spans on Document and Token level #5435
Comments
Thanks for the report. What is weird though, is that I can't replicate this. With the same model and the same spaCy version, both Could you paste the results of running this:
Could you also provide the output of running |
Thanks for the quick reply. Here is the required info:
**** Very satisfied!. **** and For further tests, I now added _sm and _lg models to the docker container but originally, just the _md model was loaded and used. This is the output of ✔ Loaded compatibility table ====================== Installed models (spaCy v2.2.4) ====================== TYPE NAME MODEL VERSION |
Here is another strange behavior of token.sent, where the token is part of the span of token.sent:
Part of the output: ... |
Ok, so you are using The other good news is, you can probably revert your model back to 2.2.0 and all this weirdness should go away. The bad news is, this definitely looks like a bug :( |
Thank you for the solution! V2.2.0 works fine for this bug. A little bit off topic but related to the same example: we are still struggling with the results of the sentence splitter in general, which seems to split sentences in this example in the middle of brackets: Is there a recommended Sentence Splitter that you would suggest in this case? |
I haven't used it, but I know of this rule-based SBD tool for English, which I think does more counting of matching paired punctuation: https://spacy.io/universe/project/python-sentence-boundary-disambiguation The different model versions produce different parses, which leads to the differing behavior. (So at least it's not a bug related to model itself!) If a parse is available, Since |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
How to reproduce the behaviour
I would like to extract the sentence index of a token in a doc. The current workaround uses token.sent and comparing the span with the sentence list of the doc.
Issue: using token.sent results in some cases in different sentence spans than sentences from doc.sents:
Your Environment
The text was updated successfully, but these errors were encountered: