Implement Character Repetition Correction [resolves #268] #277

ertugrul-dmr · 2021-06-04T13:11:26Z

Added optional mention, hashtag, emoji like character repetition normalizer for tokenizers and text2doc.
Normalizer works like how @onatyap proposed in issue Character Repetition Correction #268
It takes strings like "Süppeeeer", "Berbaaat", "Muhteşeemmmm" and returns "Süper", "Berbat", "Muhteşem" in tokenization part.
Added test cases.
I tested it on several prebuilt models and here are the results:

Prebuilt Model	Original Result	Preprocessed Result
Tweet Sentiment Classification	3-Fold F-1: 0.8640, 5-Fold F-1: 0.8669	3-Fold F-1: 0.8587 5-Fold F-1: 0.8640
Movie Review Sentiment Classification	F-1: 0.8258	F-1: 0.8242
Telco Tweet Sentiment Classification	F-1: 0.6871, Accuracy: 0.6925	F-1: 0.696, Accuracy: 0.691
Turkish Customer Reviews Classification	F-1: 0.851	F-1: 0.852

This method might be useful in some cases where data comes in high number of repeated character words.
Apart from test results @onatyap could you check the code and confirm that what you proposed in the issue?

ertugrul-dmr added 6 commits May 31, 2021 12:51

test

62f4bce

added repetition correct

620c036

removed redundant brackets

a24ecdd

added initial test cases

0860062

fixed test cases

3f16acd

fixed namings

195d7c2

ertugrul-dmr added the enhancement New feature or request label Jun 4, 2021

ertugrul-dmr requested a review from onatyap June 4, 2021 13:11

ertugrul-dmr assigned onatyap and ertugrul-dmr Jun 4, 2021

onatyap linked an issue Jun 10, 2021 that may be closed by this pull request

Character Repetition Correction #268

Open

irmakyucel mentioned this pull request Jul 7, 2021

Spelling Correction Test Results #283

Open

Provide feedback