Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential future tokenization issue in Alpaca #366

Closed
laurencer opened this issue Feb 10, 2024 · 3 comments
Closed

Potential future tokenization issue in Alpaca #366

laurencer opened this issue Feb 10, 2024 · 3 comments
Assignees

Comments

@laurencer
Copy link

In the Alpaca dataset the prompt and input text is encoded in the following way:

        encoded_prompt = self._tokenizer.encode(
            text=prompt, add_bos=True, add_eos=False
        )
        encoded_prompt_with_response = self._tokenizer.encode(
            text=prompt_with_response, add_bos=True, add_eos=True
        )
        labels = encoded_prompt_with_response.copy()


        if not self.train_on_input:
            labels[: len(encoded_prompt)] = [CROSS_ENTROPY_IGNORE_IDX] * len(
                encoded_prompt
            )

This may cause a subtle issue where sentencepiece tokenization is context-dependent and so the tokens generated for the prompt (i.e. encoded_prompt) may be different than the tokens generated for the same section of input in encoded_prompt_with_response. If this happens, the masking may be ever so slightly off (it also may effect generation in weird ways if encoding is done slightly differently).

Now this doesn't arise for the default prompt templates since they use newline delimiters (these always get encoded to 13 in the default llama2 tokenizer) but it would be a problem if someone where to copy this as the basis for their own 😄

E.g. if someone wanted to use a custom prompt template like the following; then the tokenization could vary (FWIW I've run into this issue personally when using sentence-piece and trying to train only on inputs).

# assume the provided instruction is "test" and the rest is the prompt template

>>> tokenizer.encode("<INSTRUCTION>test<RESPONSE>", add_eos=False)
[1, 529, 1177, 10810, 29965, 9838, 29958, 1688, 29966, 1525, 5550, 1164, 1660, 29958]
# assume the actual response begins with "<html>"

>>> tokenizer.encode("<INSTRUCTION>test<RESPONSE><html>", add_eos=False)
[1, 529, 1177, 10810, 29965, 9838, 29958, 1688, 29966, 1525, 5550, 1164, 1660, 5299, 1420, 29958]

You can see this directly when looking at what the difference when roundtripping the tokenization

# last token when just prompt encoded
>>> tokenizer.decode([29958])
'>'

# last tokens when prompt + response encoded
>>> tokenizer.decode([5299, 1420, 29958])
'><html>'

# last token of the prompt when prompt + response encoded
>>> tokenizer.decode([5299])
'><'

So overall not something impacting today; but since this is one of the first datasets and may get copied by users as an example (or eventually generalized to something more flexible; or even if someone uses a custom tokenizer model) it may cause problems in the future that will be hard to track/debug.

@kartikayk
Copy link
Contributor

This is an interesting catch. I'll need to dig a bit deeper into the example you provided, but my first thought would be that the pre-processing for the dataset (and prompt template) should handle this instead of the tokenizer. Either <html> should have special handling or should be stripped out completely.

@kartikayk
Copy link
Contributor

@RdoubleA assigning to you to take a deeper look as part of the data utilities RFC.

@ebsmothers
Copy link
Contributor

This should be solved by #624 since we now split the tokenization of prompt and response into separate tokenizer.encode calls

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants