-
-
Notifications
You must be signed in to change notification settings - Fork 462
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Amount of augmentation should be sampled #228
Comments
I guess you may refer to KeywordAug or RandomCharAug. Both augmenters provide aug_char_p and aug_word_p and they work independently. aug_word_p controls how many words will be drawn from a sentence. After that aug_char_p controls how many characters will be drawn from a word. If you use Sometimes pipeline (one of the Flow class), aug_p refers percentage of executing sub-pipline. Example---------
if aug_p is 0.3 in Sometimes, it means that only 1 (0.3*4 = 1.2 and then round down to 1) pipeline will be executed. The selected pipeline is different among different execution. Agree that round down may not be a good approach. Will change it to round up. |
I think most would expect
aug_word_p
andaug_p
to be independent samples for words and their characters. Instead, these parameters specify the fraction of words and characters to augment, rounded down. This seems to lead to some odd behavior, such as not being able to both ensure that short and long texts are similarly distorted.Concrete example: we may want to simulate realistic spelling mistakes. For that, we probably want
aug_word_p*aug_p
to at most be a few %. To get that, we have to setaug_char_min=0
oraug_word_min=0
or not use the flow helpers. However, doing either of the former means that short sentences or words will never be augmented, as they are always rounded to 0.What do you think about changing these values to be independent samples (while respecting the minimums)?
The text was updated successfully, but these errors were encountered: