Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Character Repetition Correction [resolves #268] #277

Open
wants to merge 6 commits into
base: develop
Choose a base branch
from

Conversation

ertugrul-dmr
Copy link
Contributor

  • Added optional mention, hashtag, emoji like character repetition normalizer for tokenizers and text2doc.
  • Normalizer works like how @onatyap proposed in issue Character Repetition Correction #268
  • It takes strings like "Süppeeeer", "Berbaaat", "Muhteşeemmmm" and returns "Süper", "Berbat", "Muhteşem" in tokenization part.
  • Added test cases.
  • I tested it on several prebuilt models and here are the results:
Prebuilt Model Original Result Preprocessed Result
Tweet Sentiment Classification 3-Fold F-1: 0.8640, 5-Fold F-1: 0.8669 3-Fold F-1: 0.8587 5-Fold F-1: 0.8640
Movie Review Sentiment Classification F-1: 0.8258 F-1: 0.8242
Telco Tweet Sentiment Classification F-1: 0.6871, Accuracy: 0.6925 F-1: 0.696, Accuracy: 0.691
Turkish Customer Reviews Classification F-1: 0.851 F-1: 0.852
  • This method might be useful in some cases where data comes in high number of repeated character words.
  • Apart from test results @onatyap could you check the code and confirm that what you proposed in the issue?

@ertugrul-dmr ertugrul-dmr added the enhancement New feature or request label Jun 4, 2021
@ertugrul-dmr ertugrul-dmr requested a review from onatyap June 4, 2021 13:11
@onatyap onatyap linked an issue Jun 10, 2021 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Character Repetition Correction
2 participants