-
Notifications
You must be signed in to change notification settings - Fork 404
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Potential future tokenization issue in Alpaca #366
Comments
This is an interesting catch. I'll need to dig a bit deeper into the example you provided, but my first thought would be that the pre-processing for the dataset (and prompt template) should handle this instead of the tokenizer. Either |
@RdoubleA assigning to you to take a deeper look as part of the data utilities RFC. |
This should be solved by #624 since we now split the tokenization of prompt and response into separate tokenizer.encode calls |
In the Alpaca dataset the prompt and input text is encoded in the following way:
This may cause a subtle issue where
sentencepiece
tokenization is context-dependent and so the tokens generated for theprompt
(i.e.encoded_prompt
) may be different than the tokens generated for the same section of input inencoded_prompt_with_response
. If this happens, the masking may be ever so slightly off (it also may effect generation in weird ways if encoding is done slightly differently).Now this doesn't arise for the default prompt templates since they use newline delimiters (these always get encoded to
13
in the default llama2 tokenizer) but it would be a problem if someone where to copy this as the basis for their own 😄E.g. if someone wanted to use a custom prompt template like the following; then the tokenization could vary (FWIW I've run into this issue personally when using sentence-piece and trying to train only on inputs).
You can see this directly when looking at what the difference when roundtripping the tokenization
So overall not something impacting today; but since this is one of the first datasets and may get copied by users as an example (or eventually generalized to something more flexible; or even if someone uses a custom tokenizer model) it may cause problems in the future that will be hard to track/debug.
The text was updated successfully, but these errors were encountered: