Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[7/7] Multimodal datasets (The Cauldron, LLaVA-Instruct-150K) #1158

Merged
merged 60 commits into from
Sep 4, 2024

Conversation

RdoubleA
Copy link
Contributor

@RdoubleA RdoubleA commented Jul 10, 2024

Changelog

  • Add two example multimodal dataset builders, each with their own dataset transform: The Cauldron and LLaVA-Instruct-150K. Both require slightly different preprocessing and serve as good examples
  • Add a utility to quickly map raw text with image tags into what's expected for Message content field
  • Upgrade ShareGPTToMessages with support for an image column

Test plan

  • Unit test for The Cauldron, LLaVA Instruct datasets
  • Unit test for split_text_by_image_tag

Copy link

pytorch-bot bot commented Jul 10, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1158

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 703b986 with merge base c6693d4 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 10, 2024
@RdoubleA RdoubleA marked this pull request as draft July 10, 2024 00:26
@RdoubleA RdoubleA changed the title [WIP] Multimodal datasets (The Cauldron, LLaVA-Instruct-150K) Multimodal datasets (The Cauldron, LLaVA-Instruct-150K) Aug 22, 2024
@RdoubleA RdoubleA marked this pull request as ready for review August 22, 2024 01:13
@RdoubleA RdoubleA changed the title Multimodal datasets (The Cauldron, LLaVA-Instruct-150K) [7/7] Multimodal datasets (The Cauldron, LLaVA-Instruct-150K) Aug 22, 2024
Copy link
Contributor

@ebsmothers ebsmothers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to fix CI

torchtune/data/_utils.py Outdated Show resolved Hide resolved
torchtune/data/_utils.py Show resolved Hide resolved
torchtune/data/_messages.py Outdated Show resolved Hide resolved
torchtune/datasets/multimodal/_llava_instruct.py Outdated Show resolved Hide resolved
torchtune/datasets/multimodal/_llava_instruct.py Outdated Show resolved Hide resolved
torchtune/datasets/multimodal/_the_cauldron.py Outdated Show resolved Hide resolved
torchtune/datasets/multimodal/_the_cauldron.py Outdated Show resolved Hide resolved
@codecov-commenter
Copy link

codecov-commenter commented Sep 3, 2024

Codecov Report

Attention: Patch coverage is 88.41463% with 19 lines in your changes missing coverage. Please review.

Project coverage is 72.46%. Comparing base (71be8ad) to head (8079294).
Report is 5 commits behind head on main.

Files with missing lines Patch % Lines
torchtune/datasets/multimodal/_llava_instruct.py 72.97% 10 Missing ⚠️
torchtune/datasets/multimodal/_the_cauldron.py 71.87% 9 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1158      +/-   ##
==========================================
+ Coverage   72.25%   72.46%   +0.20%     
==========================================
  Files         274      279       +5     
  Lines       13278    13475     +197     
==========================================
+ Hits         9594     9764     +170     
- Misses       3684     3711      +27     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@ebsmothers ebsmothers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more comments but no major concerns from me. Stamping to unblock

Comment on lines +71 to +74
-2: 1,
12: 1,
10: 1,
-1: 1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor thing but why is our dummy tokenizer returning negatives? I thought it was just the length of each word (or am I misunderstanding)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image token, eos id

torchtune/datasets/multimodal/_llava_instruct.py Outdated Show resolved Hide resolved
Comment on lines 186 to 189
if model_transform.max_seq_len is None:
raise ValueError(
"PackedDataset requires a max_seq_len to be set on the tokenizer."
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part is a bit confusing to me. So model transform has a max_seq_len, but that's really just associated with the tokenizer, right? So do we pass that up from the tokenizer to the model transform (since it's composed of both tokenizer and image transform)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

excellent question, @pbontrager and I were considering enforcing some contract at the model transform level, but maybe for now we just pass that up from the tokenizer

Comment on lines +139 to +142
model_transform (Transform): model-specific transform class that takes in a sample dict and applies custom
transforms on the keys. It should consist of at minimum two components: text tokenization (called
on the "messages" field) and image transform (called on the "images" field). The keys returned by
the model transform should be aligned with the expected inputs into the model.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to harp on this, but can we have a more explicit example until the Flamingo transform is available? Even if it's just a mock tokenizer and image transform with clearly-defined API contracts in a code snippet, I think that will help a lot

@@ -463,3 +463,45 @@ def __call__(self, sample: Mapping[str, Any]) -> Mapping[str, Any]:
updated_messages.append(Message.from_dict(message))

return {"messages": updated_messages}


def validate_messages(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're moving this here, should also delete from data/_utils.py?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also for multimodal, do we need to do any extra validation? E.g. number of images == number of messages with type == 'image'? (Not saying we should do it here btw)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point... I am not sure yet where this should happen. maybe note this as a followup once the whole end-to-end is established

@RdoubleA RdoubleA merged commit cf327a9 into pytorch:main Sep 4, 2024
20 checks passed
@RdoubleA RdoubleA deleted the mm_dataset branch September 4, 2024 17:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants