-
Notifications
You must be signed in to change notification settings - Fork 404
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alpaca prompt template #515
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/515
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit a8be21b with merge base e164402 (): This comment was automatically generated by Dr. CI and updates every 15 minutes. |
✅ Deploy Preview for torchtune-preview ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
torchtune/data/templates.py
Outdated
@@ -0,0 +1,87 @@ | |||
# Copyright (c) Meta Platforms, Inc. and affiliates. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should have a leading underscore and an init.py file that exposes AlpacaPromptTemplate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops, forgot to add that
|
||
|
||
class TestAlpacaInstructTemplate: | ||
def test_format(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually configured the original test to include real data from the alpaca test. Can you do the same? It helps to verify that the test passes on real data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact why not just refactor data from that test directly?
|
||
Args: | ||
sample (Mapping): a single data sample with instruction | ||
column_map (Optional[Dict[str, str]]): a mapping from the expected |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
column_map
seems like a nice generalization, but I don't quite understand the use case. Can you expand on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see all the prompting strategies inheriting from InstructionPromptTokenizingStrategy
here in Axolotl: https:/OpenAccess-AI-Collective/axolotl/blob/2ea70ebbd8f1d8d46e692afd05773dcf06626601/src/axolotl/prompt_tokenizers.py#L148
column map serves the purpose of making sure the right columns are used for instruction and input. For samsum and grammar datasets, we will need to use this because the datasets on the hub will not be using "instruction" or "input" as their column names since they are specific types of instruct tasks
Shall we move dataset folder under data folder? |
Saving this for a later PR, as that will require considerable refactoring |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the quick change!
Context
We need to have a standardized set of prompt templates for our flagship datasets and to enable users to configure their own custom dataset, as discussed in the RFC (#493).
First, we create the
PromptTemplate
interface which all templates will be based on.AlpacaTemplate
is added to demonstrate the interface and theAlpacaDataset
is refactored to use this.Test plan
pytest tests/torchtune/datasets/test_alpaca_dataset.py
pytest tests/torchtune/data/test_templates.py