-
Notifications
You must be signed in to change notification settings - Fork 404
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[7/7] Multimodal datasets (The Cauldron, LLaVA-Instruct-150K) #1158
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1158
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 703b986 with merge base c6693d4 (): This comment was automatically generated by Dr. CI and updates every 15 minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to fix CI
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1158 +/- ##
==========================================
+ Coverage 72.25% 72.46% +0.20%
==========================================
Files 274 279 +5
Lines 13278 13475 +197
==========================================
+ Hits 9594 9764 +170
- Misses 3684 3711 +27 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more comments but no major concerns from me. Stamping to unblock
-2: 1, | ||
12: 1, | ||
10: 1, | ||
-1: 1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor thing but why is our dummy tokenizer returning negatives? I thought it was just the length of each word (or am I misunderstanding)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
image token, eos id
if model_transform.max_seq_len is None: | ||
raise ValueError( | ||
"PackedDataset requires a max_seq_len to be set on the tokenizer." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part is a bit confusing to me. So model transform has a max_seq_len, but that's really just associated with the tokenizer, right? So do we pass that up from the tokenizer to the model transform (since it's composed of both tokenizer and image transform)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
excellent question, @pbontrager and I were considering enforcing some contract at the model transform level, but maybe for now we just pass that up from the tokenizer
model_transform (Transform): model-specific transform class that takes in a sample dict and applies custom | ||
transforms on the keys. It should consist of at minimum two components: text tokenization (called | ||
on the "messages" field) and image transform (called on the "images" field). The keys returned by | ||
the model transform should be aligned with the expected inputs into the model. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry to harp on this, but can we have a more explicit example until the Flamingo transform is available? Even if it's just a mock tokenizer and image transform with clearly-defined API contracts in a code snippet, I think that will help a lot
@@ -463,3 +463,45 @@ def __call__(self, sample: Mapping[str, Any]) -> Mapping[str, Any]: | |||
updated_messages.append(Message.from_dict(message)) | |||
|
|||
return {"messages": updated_messages} | |||
|
|||
|
|||
def validate_messages( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're moving this here, should also delete from data/_utils.py?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also for multimodal, do we need to do any extra validation? E.g. number of images == number of messages with type == 'image'? (Not saying we should do it here btw)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point... I am not sure yet where this should happen. maybe note this as a followup once the whole end-to-end is established
Changelog
ShareGPTToMessages
with support for an image columnTest plan