Potential future tokenization issue in Alpaca #366

laurencer · 2024-02-10T23:12:34Z

In the Alpaca dataset the prompt and input text is encoded in the following way:

        encoded_prompt = self._tokenizer.encode(
            text=prompt, add_bos=True, add_eos=False
        )
        encoded_prompt_with_response = self._tokenizer.encode(
            text=prompt_with_response, add_bos=True, add_eos=True
        )
        labels = encoded_prompt_with_response.copy()


        if not self.train_on_input:
            labels[: len(encoded_prompt)] = [CROSS_ENTROPY_IGNORE_IDX] * len(
                encoded_prompt
            )

This may cause a subtle issue where sentencepiece tokenization is context-dependent and so the tokens generated for the prompt (i.e. encoded_prompt) may be different than the tokens generated for the same section of input in encoded_prompt_with_response. If this happens, the masking may be ever so slightly off (it also may effect generation in weird ways if encoding is done slightly differently).

Now this doesn't arise for the default prompt templates since they use newline delimiters (these always get encoded to 13 in the default llama2 tokenizer) but it would be a problem if someone where to copy this as the basis for their own 😄

E.g. if someone wanted to use a custom prompt template like the following; then the tokenization could vary (FWIW I've run into this issue personally when using sentence-piece and trying to train only on inputs).

# assume the provided instruction is "test" and the rest is the prompt template

>>> tokenizer.encode("<INSTRUCTION>test<RESPONSE>", add_eos=False)
[1, 529, 1177, 10810, 29965, 9838, 29958, 1688, 29966, 1525, 5550, 1164, 1660, 29958]

# assume the actual response begins with "<html>"

>>> tokenizer.encode("<INSTRUCTION>test<RESPONSE><html>", add_eos=False)
[1, 529, 1177, 10810, 29965, 9838, 29958, 1688, 29966, 1525, 5550, 1164, 1660, 5299, 1420, 29958]

You can see this directly when looking at what the difference when roundtripping the tokenization

# last token when just prompt encoded
>>> tokenizer.decode([29958])
'>'

# last tokens when prompt + response encoded
>>> tokenizer.decode([5299, 1420, 29958])
'><html>'

# last token of the prompt when prompt + response encoded
>>> tokenizer.decode([5299])
'><'

So overall not something impacting today; but since this is one of the first datasets and may get copied by users as an example (or eventually generalized to something more flexible; or even if someone uses a custom tokenizer model) it may cause problems in the future that will be hard to track/debug.

The text was updated successfully, but these errors were encountered:

kartikayk · 2024-02-25T17:46:09Z

This is an interesting catch. I'll need to dig a bit deeper into the example you provided, but my first thought would be that the pre-processing for the dataset (and prompt template) should handle this instead of the tokenizer. Either <html> should have special handling or should be stripped out completely.

kartikayk · 2024-02-25T17:58:52Z

@RdoubleA assigning to you to take a deeper look as part of the data utilities RFC.

ebsmothers · 2024-04-11T15:52:38Z

This should be solved by #624 since we now split the tokenization of prompt and response into separate tokenizer.encode calls

kartikayk assigned RdoubleA Feb 25, 2024

ebsmothers mentioned this issue Mar 31, 2024

Refactor datasets and tokenizer #624

Merged

ebsmothers closed this as completed Apr 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential future tokenization issue in Alpaca #366

Potential future tokenization issue in Alpaca #366

laurencer commented Feb 10, 2024

kartikayk commented Feb 25, 2024

kartikayk commented Feb 25, 2024

ebsmothers commented Apr 11, 2024

Potential future tokenization issue in Alpaca #366

Potential future tokenization issue in Alpaca #366

Comments

laurencer commented Feb 10, 2024

kartikayk commented Feb 25, 2024

kartikayk commented Feb 25, 2024

ebsmothers commented Apr 11, 2024