Instruct and chat datasets docs (#1571)

pytorch · Sep 20, 2024 · 9a863c8 · 9a863c8
1 parent e3718e8
commit 9a863c8
Show file tree

Hide file tree

Showing 5 changed files with 591 additions and 1 deletion.
diff --git a/docs/source/basics/chat_datasets.rst b/docs/source/basics/chat_datasets.rst
@@ -0,0 +1,357 @@
+.. _chat_dataset_usage_label:
+
+=============
+Chat Datasets
+=============
+
+Chat datasets involve multi-turn conversations (multiple back-and-forths) between user and assistant.
+
+.. code-block:: python
+
+ [
+ {"role": "user", "content": "What is the answer to the ultimate question of life?"},
+ {"role": "assistant", "content": "The answer is 42."},
+ {"role": "user", "content": "That's ridiculous"},
+ {"role": "assistant", "content": "Oh I know."},
+ ]
+
+This is more structured than freeform text association that models are typically pre-trained with,
+where they learn to simply predict the next token instead of responding accurately to the user.
+
+The primary entry point for fine-tuning with chat datasets in torchtune is the :func:`~torchtune.datasets.chat_dataset`
+builder. This lets you specify a local or Hugging Face dataset that follows the chat data format
+directly from the config and train your LLM on it.
+
+Example chat dataset
+--------------------
+
+.. code-block:: python
+
+ # data/my_data.json
+ [
+ {
+ "conversations": [
+ {
+ "from": "human",
+ "value": "What is the answer to life?"
+ },
+ {
+ "from": "gpt",
+ "value": "The answer is 42."
+ },
+ {
+ "from": "human",
+ "value": "That's ridiculous"
+ },
+ {
+ "from": "gpt",
+ "value": "Oh I know."
+ }
+ ]
+ }
+ ]
+
+.. code-block:: python
+
+ from torchtune.models.mistral import mistral_tokenizer
+ from torchtune.datasets import chat_dataset
+
+ m_tokenizer = mistral_tokenizer(
+ path="/tmp/Mistral-7B-v0.1/tokenizer.model",
+ prompt_template="torchtune.models.mistral.MistralChatTemplate",
+ max_seq_len=8192,
+ )
+ ds = chat_dataset(
+ tokenizer=m_tokenizer,
+ source="json",
+ data_files="data/my_data.json",
+ split="train",
+ conversation_column="conversations",
+ conversation_style="sharegpt",
+ # By default, user prompt is ignored in loss. Set to True to include it
+ train_on_input=True,
+ new_system_prompt=None,
+ )
+ tokenized_dict = ds[0]
+ tokens, labels = tokenized_dict["tokens"], tokenized_dict["labels"]
+ print(m_tokenizer.decode(tokens))
+ # [INST] What is the answer to life? [/INST] The answer is 42. [INST] That's ridiculous [/INST] Oh I know.
+ print(labels)
+ # [1, 733, 16289, 28793, 1824, 349, 272, 4372, ...]
+
+.. code-block:: yaml
+
+ # In config
+ tokenizer:
+ _component_: torchtune.models.mistral.mistral_tokenizer
+ path: /tmp/Mistral-7B-v0.1/tokenizer.model
+ prompt_template: torchtune.models.mistral.MistralChatTemplate
+ max_seq_len: 8192
+
+ dataset:
+ _component_: torchtune.datasets.chat_dataset
+ source: json
+ data_files: data/my_data.json
+ split: train
+ conversation_column: conversations
+ conversation_style: sharegpt
+ train_on_input: True
+ new_system_prompt: null
+
+Chat dataset format
+-------------------
+
+Chat datasets typically have a single column named "conversations" or "messages" that contains a list of messages on a single topic
+per sample. The list of messages could include a system prompt, multiple turns between user and assistant, and tool calls/returns.
+
+.. code-block:: text
+
+ | conversations |
+ |--------------------------------------------------------------|
+ | [{"role": "user", "content": "What day is today?"}, |
+ | {"role": "assistant", "content": "It is Tuesday."}] |
+ | [{"role": "user", "content": "What about tomorrow?"}, |
+ | {"role": "assistant", "content": "Tomorrow is Wednesday."}] |
+
+As an example, you can see the schema of the `SlimOrca dataset <https://huggingface.co/datasets/Open-Orca/SlimOrca-Dedup>`_.
+
+Loading chat datasets from Hugging Face
+---------------------------------------
+
+You need to pass in the dataset repo name to ``source``, select one of the conversation styles in ``conversation_style``, and specify the ``conversation_column``.
+For most HF datasets, you will also need to specify the ``split``.
+
+.. code-block:: python
+
+ from torchtune.models.gemma import gemma_tokenizer
+ from torchtune.datasets import chat_dataset
+
+ g_tokenizer = gemma_tokenizer("/tmp/gemma-7b/tokenizer.model")
+ ds = chat_dataset(
+ tokenizer=g_tokenizer,
+ source="Open-Orca/SlimOrca-Dedup",
+ conversation_column="conversations",
+ conversation_style="sharegpt",
+ split="train",
+ )
+
+.. code-block:: yaml
+
+ # Tokenizer is passed into the dataset in the recipe
+ dataset:
+ _component_: torchtune.datasets.chat_dataset
+ source: Open-Orca/SlimOrca-Dedup
+ conversation_column: conversations
+ conversation_style: sharegpt
+ split: train
+
+
+Loading local and remote chat datasets
+--------------------------------------
+
+To load in a local or remote dataset via https that has conversational data, you need to additionally specify the ``data_files`` and ``split``
+arguments. See Hugging Face's ``load_dataset`` `documentation <https://huggingface.co/docs/datasets/main/en/loading#local-and-remote-files>`_
+for more details on loading local or remote files.
+
+.. code-block:: python
+
+ from torchtune.models.gemma import gemma_tokenizer
+ from torchtune.datasets import chat_dataset
+
+ g_tokenizer = gemma_tokenizer("/tmp/gemma-7b/tokenizer.model")
+ ds = chat_dataset(
+ tokenizer=g_tokenizer,
+ source="json",
+ conversation_column="conversations",
+ conversation_style="sharegpt",
+ data_files="data/my_data.json",
+ split="train",
+ )
+
+.. code-block:: yaml
+
+ # Tokenizer is passed into the dataset in the recipe
+ dataset:
+ _component_: torchtune.datasets.chat_dataset
+ source: json
+ conversation_column: conversations
+ conversation_style: sharegpt
+ data_files: data/my_data.json
+ split: train
+
+Specifying conversation style
+-----------------------------
+
+The structure of the conversation in the raw dataset can vary widely with different role names and different fields
+indicating the message content name. There are a few standardized formats that are common across many datasets.
+We have built-in converters to convert these standardized formats into a list of torchtune :class:`~torchtune.data.Message`
+that follows this format:
+
+.. code-block:: python
+
+ [
+ {
+ "role": "system" | "user" | "assistant" | "ipython",
+ "content": <message>,
+ },
+ ...
+ ]
+
+``"sharegpt"``
+^^^^^^^^^^^^^^
+The associated message transform is :class:`~torchtune.data.ShareGPTToMessages`. The expected format is:
+
+.. code-block:: python
+
+ {
+ "conversations": [
+ {
+ "from": "system" | "human" | "gpt",
+ "value": <message>,
+ },
+ ...
+ ]
+ }
+
+You can specify ``conversation_style=sharegpt`` in code or config:
+
+.. code-block:: python
+
+ from torchtune.models.gemma import gemma_tokenizer
+ from torchtune.datasets import chat_dataset
+
+ g_tokenizer = gemma_tokenizer("/tmp/gemma-7b/tokenizer.model")
+ ds = chat_dataset(
+ tokenizer=g_tokenizer,
+ source="json",
+ conversation_column="conversations",
+ conversation_style="sharegpt",
+ data_files="data/my_data.json",
+ split="train",
+ )
+
+.. code-block:: yaml
+
+ # Tokenizer is passed into the dataset in the recipe
+ dataset:
+ _component_: torchtune.datasets.chat_dataset
+ source: json
+ conversation_column: conversations
+ conversation_style: sharegpt
+ data_files: data/my_data.json
+ split: train
+
+``"json"``
+^^^^^^^^^^
+The associated message transform is :class:`~torchtune.data.JSONToMessages`. The expected format is:
+
+.. code-block:: python
+
+ {
+ "messages": [
+ {
+ "role": "system" | "user" | "assistant",
+ "content": <message>,
+ },
+ ...
+ ]
+ }
+
+You can specify ``conversation_style=json`` in code or config:
+
+.. code-block:: python
+
+ from torchtune.models.gemma import gemma_tokenizer
+ from torchtune.datasets import chat_dataset
+
+ g_tokenizer = gemma_tokenizer("/tmp/gemma-7b/tokenizer.model")
+ ds = chat_dataset(
+ tokenizer=g_tokenizer,
+ source="json",
+ conversation_column="conversations",
+ conversation_style="json",
+ data_files="data/my_data.json",
+ split="train",
+ )
+
+.. code-block:: yaml
+
+ # Tokenizer is passed into the dataset in the recipe
+ dataset:
+ _component_: torchtune.datasets.chat_dataset
+ source: json
+ conversation_column: conversations
+ conversation_style: json
+ data_files: data/my_data.json
+ split: train
+
+If your dataset does not fit one of the above conversation styles, then you will need to create a custom message transform.
+
+
+Renaming columns
+----------------
+
+To specify the column that contains your conversation data, use ``conversation_column``.
+
+.. code-block:: python
+
+ # data/my_data.json
+ [
+ {
+ "dialogue": [
+ {
+ "from": "human",
+ "value": "What is the answer to life?"
+ },
+ {
+ "from": "gpt",
+ "value": "The answer is 42."
+ },
+ {
+ "from": "human",
+ "value": "That's ridiculous"
+ },
+ {
+ "from": "gpt",
+ "value": "Oh I know."
+ }
+ ]
+ }
+ ]
+
+.. code-block:: python
+
+ from torchtune.models.gemma import gemma_tokenizer
+ from torchtune.datasets import chat_dataset
+
+ g_tokenizer = gemma_tokenizer("/tmp/gemma-7b/tokenizer.model")
+ ds = chat_dataset(
+ tokenizer=g_tokenizer,
+ source="json",
+ conversation_column="dialogue",
+ conversation_style="sharegpt",
+ data_files="data/my_data.json",
+ split="train",
+ )
+
+.. code-block:: yaml
+
+ # Tokenizer is passed into the dataset in the recipe
+ dataset:
+ _component_: torchtune.datasets.chat_dataset
+ source: json
+ conversation_column: dialogue
+ conversation_style: sharegpt
+ data_files: data/my_data.json
+ split: train
+
+
+Chat templates
+--------------
+
+Chat templates are defined the same way as instruct templates in :func:`~torchtune.datasets.instruct_dataset`. See :ref:`instruct_template` for more info.
+
+
+Built-in chat datasets
+----------------------
+- :class:`~torchtune.datasets.slimorca_dataset`