[7/7] Multimodal datasets (The Cauldron, LLaVA-Instruct-150K) #1158

RdoubleA · 2024-07-10T00:26:15Z

Changelog

Add two example multimodal dataset builders, each with their own dataset transform: The Cauldron and LLaVA-Instruct-150K. Both require slightly different preprocessing and serve as good examples
Add a utility to quickly map raw text with image tags into what's expected for Message content field
Upgrade ShareGPTToMessages with support for an image column

Test plan

Unit test for The Cauldron, LLaVA Instruct datasets
Unit test for split_text_by_image_tag

…transforms

pytorch-bot · 2024-07-10T00:26:18Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1158

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 703b986 with merge base c6693d4 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ebsmothers

Need to fix CI

torchtune/data/_utils.py

torchtune/data/_messages.py

torchtune/datasets/multimodal/_llava_instruct.py

torchtune/datasets/multimodal/_the_cauldron.py

codecov-commenter · 2024-09-03T23:39:52Z

Codecov Report

Attention: Patch coverage is 88.41463% with 19 lines in your changes missing coverage. Please review.

Project coverage is 72.46%. Comparing base (71be8ad) to head (8079294).
Report is 5 commits behind head on main.

Files with missing lines	Patch %	Lines
torchtune/datasets/multimodal/_llava_instruct.py	72.97%	10 Missing ⚠️
torchtune/datasets/multimodal/_the_cauldron.py	71.87%	9 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1158      +/-   ##
==========================================
+ Coverage   72.25%   72.46%   +0.20%     
==========================================
  Files         274      279       +5     
  Lines       13278    13475     +197     
==========================================
+ Hits         9594     9764     +170     
- Misses       3684     3711      +27

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ebsmothers

A few more comments but no major concerns from me. Stamping to unblock

ebsmothers · 2024-09-04T15:50:51Z

tests/torchtune/datasets/multimodal/test_llava_instruct_dataset.py

+ -2: 1,
+ 12: 1,
+ 10: 1,
+ -1: 1,


Minor thing but why is our dummy tokenizer returning negatives? I thought it was just the length of each word (or am I misunderstanding)

image token, eos id

torchtune/datasets/multimodal/_llava_instruct.py

ebsmothers · 2024-09-04T16:02:53Z

torchtune/datasets/multimodal/_llava_instruct.py

+ if model_transform.max_seq_len is None:
+ raise ValueError(
+ "PackedDataset requires a max_seq_len to be set on the tokenizer."
+ )


This part is a bit confusing to me. So model transform has a max_seq_len, but that's really just associated with the tokenizer, right? So do we pass that up from the tokenizer to the model transform (since it's composed of both tokenizer and image transform)?

excellent question, @pbontrager and I were considering enforcing some contract at the model transform level, but maybe for now we just pass that up from the tokenizer

ebsmothers · 2024-09-04T16:08:31Z

torchtune/datasets/multimodal/_the_cauldron.py

+ model_transform (Transform): model-specific transform class that takes in a sample dict and applies custom
+ transforms on the keys. It should consist of at minimum two components: text tokenization (called
+ on the "messages" field) and image transform (called on the "images" field). The keys returned by
+ the model transform should be aligned with the expected inputs into the model.


Sorry to harp on this, but can we have a more explicit example until the Flamingo transform is available? Even if it's just a mock tokenizer and image transform with clearly-defined API contracts in a code snippet, I think that will help a lot

ebsmothers · 2024-09-04T16:17:35Z

torchtune/data/_messages.py

@@ -463,3 +463,45 @@ def __call__(self, sample: Mapping[str, Any]) -> Mapping[str, Any]:
 updated_messages.append(Message.from_dict(message))

 return {"messages": updated_messages}
+
+
+def validate_messages(


If we're moving this here, should also delete from data/_utils.py?

Also for multimodal, do we need to do any extra validation? E.g. number of images == number of messages with type == 'image'? (Not saying we should do it here btw)

good point... I am not sure yet where this should happen. maybe note this as a followup once the whole end-to-end is established

RdoubleA added 30 commits June 11, 2024 23:49

complete tokenizer refactor

75dae87

move tokenizers under data/

0c20ba9

fix all tests

730a2c9

Merge branch 'main' into tokenizer

acf7e81

start to address comments

2ae157c

load in special tokens, move tokenizer directory back, address comments

6a50cd5

fix encode whitespace

61534d0

updates after manual comparisons

1d6e5e3

default special tokens

5712de4

fix docs

d84bbda

fix doc strings

5a8b82b

Merge branch 'main' into tokenizer

52643cb

fix tests

a00c1dc

fix SP test

29273ca

add image support

aa43095

tool support

8afaaf9

update tests

d3d4b66

update tests

d326dca

use images as attachments instead

58e3e9d

update all tests

7fdccae

use list of dicts for MM messages

820d9ac

fix chat formats

7ba4216

add multimodal dataset, test, and the cauldron

42f8c83

multimodal dataset test

7cad2dc

fix rebase

335e85f

Merge branch 'main' into tokenizer

adca77e

update api ref

b204563

Merge branch 'main' into tokenizer

e236916

fix llama3 toeknizer test:

93028cf

add image support

fb12cbb

RdoubleA added 6 commits July 3, 2024 07:39

Merge branch 'tokenizer_updates' into mm_dataset

58babf0

fix merge

7bcdaf8

Merge branch 'main' into mm_dataset

82e1dea

fix merge

ff81c5c

multimodal dataset, unit test, and two example dataset builders with …

258e98f

…transforms

Merge branch 'main' into mm_dataset

1410d70

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 10, 2024

RdoubleA marked this pull request as draft July 10, 2024 00:26

RdoubleA added 2 commits August 21, 2024 14:30

Merge branch 'main' into mm_dataset

5aea048

update with latest APIs

2731a60

RdoubleA changed the title ~~[WIP] Multimodal datasets (The Cauldron, LLaVA-Instruct-150K)~~ Multimodal datasets (The Cauldron, LLaVA-Instruct-150K) Aug 22, 2024

RdoubleA marked this pull request as ready for review August 22, 2024 01:13

RdoubleA changed the title ~~Multimodal datasets (The Cauldron, LLaVA-Instruct-150K)~~ [7/7] Multimodal datasets (The Cauldron, LLaVA-Instruct-150K) Aug 22, 2024

RdoubleA added 3 commits August 21, 2024 18:20

fix lint

8530958

Merge remote-tracking branch 'upstream/main' into mm_dataset

ce1fe8a

Merge branch 'main' into mm_dataset

ddd5e86

ebsmothers reviewed Aug 28, 2024

View reviewed changes

RdoubleA added 6 commits September 3, 2024 11:28

Merge branch 'main' into mm_dataset

939a3f5

remove image handling

278649a

Merge branch 'main' into mm_dataset

5f2021f

separate llava transform

bb1a8b5

update tests

e16b2ad

fix tests

8079294

ebsmothers approved these changes Sep 4, 2024

View reviewed changes

RdoubleA added 2 commits September 4, 2024 09:56

Merge branch 'main' into mm_dataset

b006a90

update docstrings

703b986

RdoubleA merged commit cf327a9 into pytorch:main Sep 4, 2024
20 checks passed

RdoubleA deleted the mm_dataset branch September 4, 2024 17:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[7/7] Multimodal datasets (The Cauldron, LLaVA-Instruct-150K) #1158

[7/7] Multimodal datasets (The Cauldron, LLaVA-Instruct-150K) #1158

RdoubleA commented Jul 10, 2024 •

edited

Loading

pytorch-bot bot commented Jul 10, 2024 •

edited

Loading

ebsmothers left a comment

codecov-commenter commented Sep 3, 2024 •

edited

Loading

ebsmothers left a comment

ebsmothers Sep 4, 2024

RdoubleA Sep 4, 2024

ebsmothers Sep 4, 2024

RdoubleA Sep 4, 2024

ebsmothers Sep 4, 2024

ebsmothers Sep 4, 2024

ebsmothers Sep 4, 2024

RdoubleA Sep 4, 2024

+ -2: 1,
+: 1,
+: 1,
+ -1: 1,

[7/7] Multimodal datasets (The Cauldron, LLaVA-Instruct-150K) #1158

[7/7] Multimodal datasets (The Cauldron, LLaVA-Instruct-150K) #1158

Conversation

RdoubleA commented Jul 10, 2024 • edited Loading

Changelog

Test plan

pytorch-bot bot commented Jul 10, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1158

✅ No Failures

ebsmothers left a comment

Choose a reason for hiding this comment

codecov-commenter commented Sep 3, 2024 • edited Loading

Codecov Report

ebsmothers left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RdoubleA commented Jul 10, 2024 •

edited

Loading

pytorch-bot bot commented Jul 10, 2024 •

edited

Loading

codecov-commenter commented Sep 3, 2024 •

edited

Loading