Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect tokenization of dash punctuation in Spanish #3277

Closed
BrianSladek opened this issue Feb 14, 2019 · 2 comments
Closed

Incorrect tokenization of dash punctuation in Spanish #3277

BrianSladek opened this issue Feb 14, 2019 · 2 comments
Labels
feat / tokenizer Feature: Tokenizer lang / es Spanish language data and models perf / accuracy Performance: accuracy

Comments

@BrianSladek
Copy link

In Spanish text, the conventions for using dashes and em-dashes as punctuation seems to be considerably different than in English. Spacy often does not tokenize the dash or em-dash as a separate token, instead keeping it attached to the closest word.

For example, the Spanish sentence:
—Yo me llamo... –murmuró el niño– Emilio Sánchez Pérez.
English Translation:
"My name is...", murmured the boy, "Emilio Sanchez Perez."

Here, the Spanish dash is used like a comma. The em-dash at the beginning of the sentence is used like a double quote. I believe that the fact that there is no space between the dash and word is throwing off the tokenizer.

The Spanish sentence above is tokenized as:
—Yo
me
llamo
...
–murmuró
el
niño–
Emilio
Sánchez
Pérez
.

I would expect the tokenization to be

Yo
me
llamo
...

murmuró
el
niño

Emilio
Sánchez
Pérez
.

Your Environment

  • spaCy version: 2.0.12
  • Platform: Darwin-18.0.0-x86_64-i386-64bit
  • Python version: 3.7.0
  • Models: de, es, en
@ines ines added lang / es Spanish language data and models feat / tokenizer Feature: Tokenizer perf / accuracy Performance: accuracy labels Feb 14, 2019
@ines
Copy link
Member

ines commented Feb 14, 2019

Thanks for the report and explanation – I didn't actually know this about Spanish!

I just opened a PR in #3281 that adds the dashes to the prefixes and suffixes and the change didn't break anything else, so I guess it seems fine 🤷‍♀️

@lock
Copy link

lock bot commented Mar 17, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Mar 17, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feat / tokenizer Feature: Tokenizer lang / es Spanish language data and models perf / accuracy Performance: accuracy
Projects
None yet
Development

No branches or pull requests

3 participants