Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Amount of augmentation should be sampled #228

Closed
cwenner opened this issue Jul 4, 2021 · 1 comment
Closed

Amount of augmentation should be sampled #228

cwenner opened this issue Jul 4, 2021 · 1 comment
Labels
enhancement New feature or request

Comments

@cwenner
Copy link

cwenner commented Jul 4, 2021

I think most would expect aug_word_p and aug_p to be independent samples for words and their characters. Instead, these parameters specify the fraction of words and characters to augment, rounded down. This seems to lead to some odd behavior, such as not being able to both ensure that short and long texts are similarly distorted.

Concrete example: we may want to simulate realistic spelling mistakes. For that, we probably want aug_word_p*aug_p to at most be a few %. To get that, we have to set aug_char_min=0 or aug_word_min=0 or not use the flow helpers. However, doing either of the former means that short sentences or words will never be augmented, as they are always rounded to 0.

What do you think about changing these values to be independent samples (while respecting the minimums)?

@cwenner cwenner changed the title Amount of augmentation should be random Amount of augmentation should be sampled Jul 4, 2021
@makcedward makcedward added the enhancement New feature or request label Jul 15, 2021
@makcedward
Copy link
Owner

I guess you may refer to KeywordAug or RandomCharAug.

Both augmenters provide aug_char_p and aug_word_p and they work independently. aug_word_p controls how many words will be drawn from a sentence. After that aug_char_p controls how many characters will be drawn from a word.
Example---------
Input: "I eat apple."
Paramter: aug_word_p = 0.3, aug_char_p = 0.5
One of the words will be drawn. Let assume "apple" is picked.
Within "apple", 2 characters (0.5 * 5 = 2.5 and then round down to 2), it can become "appkW".

If you use Sometimes pipeline (one of the Flow class), aug_p refers percentage of executing sub-pipline.

Example---------
Input: [KeywordAug, RandomCharAug, RandomWordAug]

naf.Sometimes(
    [KeywordAug, RandomCharAug, RandomWordAug, RandomWordAug]
)

if aug_p is 0.3 in Sometimes, it means that only 1 (0.3*4 = 1.2 and then round down to 1) pipeline will be executed. The selected pipeline is different among different execution.

Agree that round down may not be a good approach. Will change it to round up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants