Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alpaca prompt template #515

Merged
merged 5 commits into from
Mar 18, 2024
Merged

Alpaca prompt template #515

merged 5 commits into from
Mar 18, 2024

Conversation

RdoubleA
Copy link
Contributor

Context

We need to have a standardized set of prompt templates for our flagship datasets and to enable users to configure their own custom dataset, as discussed in the RFC (#493).

First, we create the PromptTemplate interface which all templates will be based on. AlpacaTemplate is added to demonstrate the interface and the AlpacaDataset is refactored to use this.

Test plan

pytest tests/torchtune/datasets/test_alpaca_dataset.py
pytest tests/torchtune/data/test_templates.py

Copy link

pytorch-bot bot commented Mar 18, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/515

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit a8be21b with merge base e164402 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 18, 2024
Copy link

netlify bot commented Mar 18, 2024

Deploy Preview for torchtune-preview ready!

Name Link
🔨 Latest commit a8be21b
🔍 Latest deploy log https://app.netlify.com/sites/torchtune-preview/deploys/65f8b882a010500008fed3cf
😎 Deploy Preview https://deploy-preview-515--torchtune-preview.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@@ -0,0 +1,87 @@
# Copyright (c) Meta Platforms, Inc. and affiliates.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should have a leading underscore and an init.py file that exposes AlpacaPromptTemplate

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, forgot to add that



class TestAlpacaInstructTemplate:
def test_format(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually configured the original test to include real data from the alpaca test. Can you do the same? It helps to verify that the test passes on real data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact why not just refactor data from that test directly?


Args:
sample (Mapping): a single data sample with instruction
column_map (Optional[Dict[str, str]]): a mapping from the expected
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

column_map seems like a nice generalization, but I don't quite understand the use case. Can you expand on this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see all the prompting strategies inheriting from InstructionPromptTokenizingStrategy here in Axolotl: https:/OpenAccess-AI-Collective/axolotl/blob/2ea70ebbd8f1d8d46e692afd05773dcf06626601/src/axolotl/prompt_tokenizers.py#L148

column map serves the purpose of making sure the right columns are used for instruction and input. For samsum and grammar datasets, we will need to use this because the datasets on the hub will not be using "instruction" or "input" as their column names since they are specific types of instruct tasks

@SLR722
Copy link
Contributor

SLR722 commented Mar 18, 2024

Shall we move dataset folder under data folder?

@RdoubleA
Copy link
Contributor Author

Shall we move dataset folder under data folder?

Saving this for a later PR, as that will require considerable refactoring

Copy link
Contributor

@kartikayk kartikayk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick change!

@RdoubleA RdoubleA merged commit e145b16 into main Mar 18, 2024
21 checks passed
@RdoubleA RdoubleA deleted the rafiayub/template_interface branch March 18, 2024 22:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants