-
Notifications
You must be signed in to change notification settings - Fork 404
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Instruct and chat datasets docs (#1571)
- Loading branch information
Showing
5 changed files
with
591 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,357 @@ | ||
.. _chat_dataset_usage_label: | ||
|
||
============= | ||
Chat Datasets | ||
============= | ||
|
||
Chat datasets involve multi-turn conversations (multiple back-and-forths) between user and assistant. | ||
|
||
.. code-block:: python | ||
[ | ||
{"role": "user", "content": "What is the answer to the ultimate question of life?"}, | ||
{"role": "assistant", "content": "The answer is 42."}, | ||
{"role": "user", "content": "That's ridiculous"}, | ||
{"role": "assistant", "content": "Oh I know."}, | ||
] | ||
This is more structured than freeform text association that models are typically pre-trained with, | ||
where they learn to simply predict the next token instead of responding accurately to the user. | ||
|
||
The primary entry point for fine-tuning with chat datasets in torchtune is the :func:`~torchtune.datasets.chat_dataset` | ||
builder. This lets you specify a local or Hugging Face dataset that follows the chat data format | ||
directly from the config and train your LLM on it. | ||
|
||
Example chat dataset | ||
-------------------- | ||
|
||
.. code-block:: python | ||
# data/my_data.json | ||
[ | ||
{ | ||
"conversations": [ | ||
{ | ||
"from": "human", | ||
"value": "What is the answer to life?" | ||
}, | ||
{ | ||
"from": "gpt", | ||
"value": "The answer is 42." | ||
}, | ||
{ | ||
"from": "human", | ||
"value": "That's ridiculous" | ||
}, | ||
{ | ||
"from": "gpt", | ||
"value": "Oh I know." | ||
} | ||
] | ||
} | ||
] | ||
.. code-block:: python | ||
from torchtune.models.mistral import mistral_tokenizer | ||
from torchtune.datasets import chat_dataset | ||
m_tokenizer = mistral_tokenizer( | ||
path="/tmp/Mistral-7B-v0.1/tokenizer.model", | ||
prompt_template="torchtune.models.mistral.MistralChatTemplate", | ||
max_seq_len=8192, | ||
) | ||
ds = chat_dataset( | ||
tokenizer=m_tokenizer, | ||
source="json", | ||
data_files="data/my_data.json", | ||
split="train", | ||
conversation_column="conversations", | ||
conversation_style="sharegpt", | ||
# By default, user prompt is ignored in loss. Set to True to include it | ||
train_on_input=True, | ||
new_system_prompt=None, | ||
) | ||
tokenized_dict = ds[0] | ||
tokens, labels = tokenized_dict["tokens"], tokenized_dict["labels"] | ||
print(m_tokenizer.decode(tokens)) | ||
# [INST] What is the answer to life? [/INST] The answer is 42. [INST] That's ridiculous [/INST] Oh I know. | ||
print(labels) | ||
# [1, 733, 16289, 28793, 1824, 349, 272, 4372, ...] | ||
.. code-block:: yaml | ||
# In config | ||
tokenizer: | ||
_component_: torchtune.models.mistral.mistral_tokenizer | ||
path: /tmp/Mistral-7B-v0.1/tokenizer.model | ||
prompt_template: torchtune.models.mistral.MistralChatTemplate | ||
max_seq_len: 8192 | ||
dataset: | ||
_component_: torchtune.datasets.chat_dataset | ||
source: json | ||
data_files: data/my_data.json | ||
split: train | ||
conversation_column: conversations | ||
conversation_style: sharegpt | ||
train_on_input: True | ||
new_system_prompt: null | ||
Chat dataset format | ||
------------------- | ||
|
||
Chat datasets typically have a single column named "conversations" or "messages" that contains a list of messages on a single topic | ||
per sample. The list of messages could include a system prompt, multiple turns between user and assistant, and tool calls/returns. | ||
|
||
.. code-block:: text | ||
| conversations | | ||
|--------------------------------------------------------------| | ||
| [{"role": "user", "content": "What day is today?"}, | | ||
| {"role": "assistant", "content": "It is Tuesday."}] | | ||
| [{"role": "user", "content": "What about tomorrow?"}, | | ||
| {"role": "assistant", "content": "Tomorrow is Wednesday."}] | | ||
As an example, you can see the schema of the `SlimOrca dataset <https://huggingface.co/datasets/Open-Orca/SlimOrca-Dedup>`_. | ||
|
||
Loading chat datasets from Hugging Face | ||
--------------------------------------- | ||
|
||
You need to pass in the dataset repo name to ``source``, select one of the conversation styles in ``conversation_style``, and specify the ``conversation_column``. | ||
For most HF datasets, you will also need to specify the ``split``. | ||
|
||
.. code-block:: python | ||
from torchtune.models.gemma import gemma_tokenizer | ||
from torchtune.datasets import chat_dataset | ||
g_tokenizer = gemma_tokenizer("/tmp/gemma-7b/tokenizer.model") | ||
ds = chat_dataset( | ||
tokenizer=g_tokenizer, | ||
source="Open-Orca/SlimOrca-Dedup", | ||
conversation_column="conversations", | ||
conversation_style="sharegpt", | ||
split="train", | ||
) | ||
.. code-block:: yaml | ||
# Tokenizer is passed into the dataset in the recipe | ||
dataset: | ||
_component_: torchtune.datasets.chat_dataset | ||
source: Open-Orca/SlimOrca-Dedup | ||
conversation_column: conversations | ||
conversation_style: sharegpt | ||
split: train | ||
Loading local and remote chat datasets | ||
-------------------------------------- | ||
|
||
To load in a local or remote dataset via https that has conversational data, you need to additionally specify the ``data_files`` and ``split`` | ||
arguments. See Hugging Face's ``load_dataset`` `documentation <https://huggingface.co/docs/datasets/main/en/loading#local-and-remote-files>`_ | ||
for more details on loading local or remote files. | ||
|
||
.. code-block:: python | ||
from torchtune.models.gemma import gemma_tokenizer | ||
from torchtune.datasets import chat_dataset | ||
g_tokenizer = gemma_tokenizer("/tmp/gemma-7b/tokenizer.model") | ||
ds = chat_dataset( | ||
tokenizer=g_tokenizer, | ||
source="json", | ||
conversation_column="conversations", | ||
conversation_style="sharegpt", | ||
data_files="data/my_data.json", | ||
split="train", | ||
) | ||
.. code-block:: yaml | ||
# Tokenizer is passed into the dataset in the recipe | ||
dataset: | ||
_component_: torchtune.datasets.chat_dataset | ||
source: json | ||
conversation_column: conversations | ||
conversation_style: sharegpt | ||
data_files: data/my_data.json | ||
split: train | ||
Specifying conversation style | ||
----------------------------- | ||
|
||
The structure of the conversation in the raw dataset can vary widely with different role names and different fields | ||
indicating the message content name. There are a few standardized formats that are common across many datasets. | ||
We have built-in converters to convert these standardized formats into a list of torchtune :class:`~torchtune.data.Message` | ||
that follows this format: | ||
|
||
.. code-block:: python | ||
[ | ||
{ | ||
"role": "system" | "user" | "assistant" | "ipython", | ||
"content": <message>, | ||
}, | ||
... | ||
] | ||
``"sharegpt"`` | ||
^^^^^^^^^^^^^^ | ||
The associated message transform is :class:`~torchtune.data.ShareGPTToMessages`. The expected format is: | ||
|
||
.. code-block:: python | ||
{ | ||
"conversations": [ | ||
{ | ||
"from": "system" | "human" | "gpt", | ||
"value": <message>, | ||
}, | ||
... | ||
] | ||
} | ||
You can specify ``conversation_style=sharegpt`` in code or config: | ||
|
||
.. code-block:: python | ||
from torchtune.models.gemma import gemma_tokenizer | ||
from torchtune.datasets import chat_dataset | ||
g_tokenizer = gemma_tokenizer("/tmp/gemma-7b/tokenizer.model") | ||
ds = chat_dataset( | ||
tokenizer=g_tokenizer, | ||
source="json", | ||
conversation_column="conversations", | ||
conversation_style="sharegpt", | ||
data_files="data/my_data.json", | ||
split="train", | ||
) | ||
.. code-block:: yaml | ||
# Tokenizer is passed into the dataset in the recipe | ||
dataset: | ||
_component_: torchtune.datasets.chat_dataset | ||
source: json | ||
conversation_column: conversations | ||
conversation_style: sharegpt | ||
data_files: data/my_data.json | ||
split: train | ||
``"json"`` | ||
^^^^^^^^^^ | ||
The associated message transform is :class:`~torchtune.data.JSONToMessages`. The expected format is: | ||
|
||
.. code-block:: python | ||
{ | ||
"messages": [ | ||
{ | ||
"role": "system" | "user" | "assistant", | ||
"content": <message>, | ||
}, | ||
... | ||
] | ||
} | ||
You can specify ``conversation_style=json`` in code or config: | ||
|
||
.. code-block:: python | ||
from torchtune.models.gemma import gemma_tokenizer | ||
from torchtune.datasets import chat_dataset | ||
g_tokenizer = gemma_tokenizer("/tmp/gemma-7b/tokenizer.model") | ||
ds = chat_dataset( | ||
tokenizer=g_tokenizer, | ||
source="json", | ||
conversation_column="conversations", | ||
conversation_style="json", | ||
data_files="data/my_data.json", | ||
split="train", | ||
) | ||
.. code-block:: yaml | ||
# Tokenizer is passed into the dataset in the recipe | ||
dataset: | ||
_component_: torchtune.datasets.chat_dataset | ||
source: json | ||
conversation_column: conversations | ||
conversation_style: json | ||
data_files: data/my_data.json | ||
split: train | ||
If your dataset does not fit one of the above conversation styles, then you will need to create a custom message transform. | ||
|
||
|
||
Renaming columns | ||
---------------- | ||
|
||
To specify the column that contains your conversation data, use ``conversation_column``. | ||
|
||
.. code-block:: python | ||
# data/my_data.json | ||
[ | ||
{ | ||
"dialogue": [ | ||
{ | ||
"from": "human", | ||
"value": "What is the answer to life?" | ||
}, | ||
{ | ||
"from": "gpt", | ||
"value": "The answer is 42." | ||
}, | ||
{ | ||
"from": "human", | ||
"value": "That's ridiculous" | ||
}, | ||
{ | ||
"from": "gpt", | ||
"value": "Oh I know." | ||
} | ||
] | ||
} | ||
] | ||
.. code-block:: python | ||
from torchtune.models.gemma import gemma_tokenizer | ||
from torchtune.datasets import chat_dataset | ||
g_tokenizer = gemma_tokenizer("/tmp/gemma-7b/tokenizer.model") | ||
ds = chat_dataset( | ||
tokenizer=g_tokenizer, | ||
source="json", | ||
conversation_column="dialogue", | ||
conversation_style="sharegpt", | ||
data_files="data/my_data.json", | ||
split="train", | ||
) | ||
.. code-block:: yaml | ||
# Tokenizer is passed into the dataset in the recipe | ||
dataset: | ||
_component_: torchtune.datasets.chat_dataset | ||
source: json | ||
conversation_column: dialogue | ||
conversation_style: sharegpt | ||
data_files: data/my_data.json | ||
split: train | ||
Chat templates | ||
-------------- | ||
|
||
Chat templates are defined the same way as instruct templates in :func:`~torchtune.datasets.instruct_dataset`. See :ref:`instruct_template` for more info. | ||
|
||
|
||
Built-in chat datasets | ||
---------------------- | ||
- :class:`~torchtune.datasets.slimorca_dataset` |
Oops, something went wrong.