-
Notifications
You must be signed in to change notification settings - Fork 26.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feat] Add FLAVA model #16654
[feat] Add FLAVA model #16654
Conversation
eb4e9d5
to
c1489f5
Compare
The documentation is not available anymore as the PR was closed or merged. |
8ef39ad
to
5c2db5d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Main comment is a little bit below. Sorry, I've clicked review by mistake 😬 .
README.md
Outdated
@@ -265,6 +265,7 @@ Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. | |||
1. **[EncoderDecoder](https://huggingface.co/docs/transformers/model_doc/encoder-decoder)** (from Google Research) released with the paper [Leveraging Pre-trained Checkpoints for Sequence Generation Tasks](https://arxiv.org/abs/1907.12461) by Sascha Rothe, Shashi Narayan, Aliaksei Severyn. | |||
1. **[ELECTRA](https://huggingface.co/docs/transformers/model_doc/electra)** (from Google Research/Stanford University) released with the paper [ELECTRA: Pre-training text encoders as discriminators rather than generators](https://arxiv.org/abs/2003.10555) by Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning. | |||
1. **[FlauBERT](https://huggingface.co/docs/transformers/model_doc/flaubert)** (from CNRS) released with the paper [FlauBERT: Unsupervised Language Model Pre-training for French](https://arxiv.org/abs/1912.05372) by Hang Le, Loïc Vial, Jibril Frej, Vincent Segonne, Maximin Coavoux, Benjamin Lecouteux, Alexandre Allauzen, Benoît Crabbé, Laurent Besacier, Didier Schwab. | |||
1. **[FLAVA](https://huggingface.co/docs/transformers/model_doc_flava)** (from Facebook AI) released with the paper [FLAVA: A Foundational Language And Vision Alignment Model](https://arxiv.org/abs/2112.04482) by Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, and Douwe Kiela. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1. **[FLAVA](https://huggingface.co/docs/transformers/model_doc_flava)** (from Facebook AI) released with the paper [FLAVA: A Foundational Language And Vision Alignment Model](https://arxiv.org/abs/2112.04482) by Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, and Douwe Kiela. | |
1. **[FLAVA](https://huggingface.co/docs/transformers/main/model_doc/flava)** (from Facebook AI) released with the paper [FLAVA: A Foundational Language And Vision Alignment Model](https://arxiv.org/abs/2112.04482) by Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, and Douwe Kiela. |
Typo + the model will only be in the main branch until next release, so the link needs to point to the main version of the doc.
docs/source/en/model_doc/flava.mdx
Outdated
@@ -0,0 +1,105 @@ | |||
<!--Copyright 2021 The HuggingFace Team. All rights reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<!--Copyright 2021 The HuggingFace Team. All rights reserved. | |
<!--Copyright 2022 The HuggingFace Team. All rights reserved. |
docs/source/en/model_doc/flava.mdx
Outdated
State-of-the-art vision and vision-and-language models rely on large-scale visio-linguistic pretraining for obtaining good performance on a variety | ||
of downstream tasks. Generally, such models are often either cross-modal (contrastive) or multi-modal | ||
(with earlier fusion) but not both; and they often only target specific modalities or tasks. A promising | ||
direction would be to use a single holistic universal model, as a "foundation", that targets all modalities | ||
at once -- a true vision and language foundation model should be good at vision tasks, language tasks, and | ||
cross- and multi-modal vision and language tasks. We introduce FLAVA as such a model and demonstrate | ||
impressive performance on a wide range of 35 tasks spanning these target modalities. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
State-of-the-art vision and vision-and-language models rely on large-scale visio-linguistic pretraining for obtaining good performance on a variety | |
of downstream tasks. Generally, such models are often either cross-modal (contrastive) or multi-modal | |
(with earlier fusion) but not both; and they often only target specific modalities or tasks. A promising | |
direction would be to use a single holistic universal model, as a "foundation", that targets all modalities | |
at once -- a true vision and language foundation model should be good at vision tasks, language tasks, and | |
cross- and multi-modal vision and language tasks. We introduce FLAVA as such a model and demonstrate | |
impressive performance on a wide range of 35 tasks spanning these target modalities. | |
*State-of-the-art vision and vision-and-language models rely on large-scale visio-linguistic pretraining for obtaining good performance on a variety | |
of downstream tasks. Generally, such models are often either cross-modal (contrastive) or multi-modal | |
(with earlier fusion) but not both; and they often only target specific modalities or tasks. A promising | |
direction would be to use a single holistic universal model, as a "foundation", that targets all modalities | |
at once -- a true vision and language foundation model should be good at vision tasks, language tasks, and | |
cross- and multi-modal vision and language tasks. We introduce FLAVA as such a model and demonstrate | |
impressive performance on a wide range of 35 tasks spanning these target modalities.* |
docs/source/en/model_doc/flava.mdx
Outdated
impressive performance on a wide range of 35 tasks spanning these target modalities. | ||
|
||
|
||
<!-- Tips: --> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs to be filled.
docs/source/en/model_doc/flava.mdx
Outdated
<!-- Tips: --> | ||
|
||
This model was contributed by [aps](https://huggingface.co/aps). | ||
<!-- The original code can be found [here](<INSERT LINK TO GITHUB REPO HERE>).--> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs to be filled.
docs/source/en/model_doc/flava.mdx
Outdated
## FLAVAConfig | ||
|
||
[[autodoc]] FLAVAConfig | ||
- from_configs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure it's necessary here. Will be automatically added.
- from_configs |
docs/source/en/model_doc/flava.mdx
Outdated
<!-- The original code can be found [here](<INSERT LINK TO GITHUB REPO HERE>).--> | ||
|
||
|
||
## FLAVAConfig |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will put in the general comment as well but all names need to be re-case to FlavaXxx
. We don't use BERTConfig
, BERTModel
etc., but BertConfig
, BertModel
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re on this topic: There are multiple examples in codebase that actually don't follow this signature. XLM
, WavLM
, ResNet
, GPT2
, CLIP
, CTRL
, and many others. Overall, choice of case seems random and not really enforced everywhere. I am not sure why these models have this privilege and FLAVA can't?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ResNet
and WaLM
are not all caps so I don't really see your point there. If FLAVA
stands for two different words, we can capitalize the second one but I don't think that's the case.
This is not a privilege, this is from the beginnings of Transformers. We then got user feedback that those casings made it difficult for them to find the model classes, so we stopped.
src/transformers/__init__.py
Outdated
_import_structure["models.flava"].append("FLAVAFeatureExtractor") | ||
_import_structure["models.flava"].append("FLAVACodebookFeatureExtractor") | ||
_import_structure["models.flava"].append("FLAVAProcessor") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All in one line with an extend
# ( | ||
# "flava", | ||
# ( | ||
# "CLIPTokenizer", | ||
# "CLIPTokenizerFast" if is_tokenizers_available() else None, | ||
# ), | ||
# ), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be cleaned up or fixed. Looking at the processor file, it seems to be using the BertTokenizer
?
# There's no way to ignore "F401 '...' imported but unused" warnings in this | ||
# module, but to preserve other warnings. So, don't check this module at all. | ||
|
||
# Copyright 2021 Meta Platforms authors and The HuggingFace Team. All rights reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Copyright 2021 Meta Platforms authors and The HuggingFace Team. All rights reserved. | |
# Copyright 2022 Meta Platforms authors and The HuggingFace Team. All rights reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your PR, exciting to see this model arrive soon in the library!
One thing is that all names need to be re-case to FlavaXxx
. We don't use BERTConfig
, BERTModel
etc., but BertConfig
, BertModel
... Likewise we should have FlavaModel
, FlavaConfig
etc.
You use Meta or Facebook in several places. You should pick one and stick to it.
@@ -0,0 +1,611 @@ | |||
# coding=utf-8 | |||
# Copyright 2021 Meta Platforms authors and The HuggingFace Team. All rights reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Copyright 2021 Meta Platforms authors and The HuggingFace Team. All rights reserved. | |
# Copyright 2022 Meta Platforms authors and The HuggingFace Team. All rights reserved. |
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
""" FLAVA model configuration""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
""" FLAVA model configuration""" | |
""" FLAVA model configurations""" |
logger = logging.get_logger(__name__) | ||
|
||
FLAVA_PRETRAINED_CONFIG_ARCHIVE_MAP = { | ||
"flava-full": "https://huggingface.co/aps/flava-full/resolve/main/config.json", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"flava-full": "https://huggingface.co/aps/flava-full/resolve/main/config.json", | |
"aps/flava-full": "https://huggingface.co/aps/flava-full/resolve/main/config.json", |
The namespace also needs to be in the key.
This is the configuration class to store the configuration of a [`FLAVAImageModel`]. It is used to instantiate an | ||
FLAVA model according to the specified arguments, defining the model architecture. Instantiating a configuration | ||
with the defaults will yield a similar configuration to that of the FLAVA | ||
[full](https://huggingface.co/aps/flava-full) architecture. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[full](https://huggingface.co/aps/flava-full) architecture. | |
[aps/flava-full](https://huggingface.co/aps/flava-full) architecture. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would leave this as it is to avoid ambiguity in the meaning of the sentence itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it is ambiguous as it doesn't show the real name of the checkpoint.
image_size (`int`, *optional*, defaults to `224`): | ||
The size (resolution) of each image. | ||
patch_size (`int`, *optional*, defaults to `16`): | ||
The size (resolution) of each patch. | ||
num_channels (`int`, *optional*, defaults to `3`): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Numbers for default values should not be between backticks.
tests/flava/test_modeling_flava.py
Outdated
@@ -0,0 +1,1234 @@ | |||
# coding=utf-8 | |||
# Copyright 2021 The HuggingFace Inc. team. All rights reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Copyright 2021 The HuggingFace Inc. team. All rights reserved. | |
# Copyright 2022 The HuggingFace Inc. team. All rights reserved. |
@@ -0,0 +1,416 @@ | |||
# coding=utf-8 | |||
# Copyright 2021 HuggingFace Inc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Copyright 2021 HuggingFace Inc. | |
# Copyright 2022 HuggingFace Inc. |
tests/flava/test_modeling_flava.py
Outdated
def __init__( | ||
self, | ||
parent, | ||
batch_size=12, | ||
image_size=112, | ||
num_channels=3, | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can all fit in one line.
tests/flava/test_modeling_flava.py
Outdated
def _test_model( | ||
self, | ||
config, | ||
inputs, | ||
test_image=False, | ||
test_text=False, | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can all fit in one line.
tests/flava/test_processor_flava.py
Outdated
vocab_tokens = [ | ||
"[UNK]", | ||
"[CLS]", | ||
"[SEP]", | ||
"[PAD]", | ||
"[MASK]", | ||
"want", | ||
"##want", | ||
"##ed", | ||
"wa", | ||
"un", | ||
"runn", | ||
"##ing", | ||
",", | ||
"low", | ||
"lowest", | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be all in one line and deactivate black for it to make it more readable (#fmt: off
before and #fmt: on
after).
@sgugger Thanks for your review on the PR. About lowercasing things, I followed what I saw in the repo as not all models were lower case. An example of this would be CLIP where classes are named as |
|
tests/flava/test_modeling_flava.py
Outdated
@require_torch | ||
class FLAVAForPreTrainingIntegrationTest(unittest.TestCase): | ||
@slow | ||
def test_inference(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool test!
tests/flava/test_modeling_flava.py
Outdated
model_name = "aps/flava-full" | ||
codebook_name = "aps/flava-codebook" | ||
model = FLAVAForPreTraining.from_pretrained(model_name).to(torch_device) | ||
codebook = FLAVACodebook.from_pretrained(codebook_name).to(torch_device) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, don't really like that flava-full
and flava-codebook
are in two different repositories. All of our models usually work with one and only one modeling file. Can't we merge the FlavaCodebook
a model into the FlavaForPreTraining
class? Don't like that inference is dependent on two different repos
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would ideally like to avoid that because codebook is somewhat independent of FLAVA itself as it is mostly just custom DALLE encoder. A different codebook can be potentially used with flava. If I join them into one, I will lose that benefit. wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, I guess we have kind of a composition model problem here (maybe similar to Blenderbot2 or RAG cc @sgugger )
My opinion here is the following. The paper only used the DALLE encoder (no?) and that's what we expect people to use when reproducing the paper. In our experience few people expect things that have not been done in the paper to work out of the box.
The way it's currently implemented doesn't really allow the user to be able to use another encoder out-of-the-box either. E.g. the user would have to code up a new encoder class which respects the logic of:
mim_labels = codebook.get_codebook_indices(**codebook_fe([image, image], return_tensors="pt").to(torch_device))
i.e. the model would also require a get_codebook_indices(...)
function etc... This requires already quite some knowledge about the model in which case it's probably as easy to fork modeling_flava.py
and replace:
self.codebook = FlavaCodebook
inside the ForPretrainingClass
with a different encoder (might even be easier to understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So my opinion here is that we should put FlavaCodebook
inside the modeling classes that are needed so that we only have a single pytorch_model.bin
file and which makes it easier to reproduce / reuse the default Flava as described in the paper
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general "self-containment" is more important then "flexibility of modularization" for Transformers in my opinion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we would have multiple pretrained checkpoints for different codebook classes we could have added a codebook_type=dalle
flag to the config that would have switched in the correct codebook class, but nevertheless I think we should put in inside the FlavaForPreTraining
class.
Let's see what @sgugger thinks here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way it's currently implemented doesn't really allow the user to be able to use another encoder out-of-the-box either. E.g. the user would have to code up a new encoder class which respects the logic of:
Not really, because that is just one interface on how to create mim_labels
. What FLAVAPretrainedModel is concerned about is mim_labels
not where they came from or how they were created? I would ideally like to avoid creating mim_labels
inside the pretraining class if possible.
tests/flava/test_modeling_flava.py
Outdated
|
||
@require_vision | ||
@require_torch | ||
class FLAVAModelIntegrationTest(unittest.TestCase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't Flava also work for Text classification (GLUE) or image classification? Can't be add those heads here as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And the corresponding classes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They should! They are planned for a separate PR. I am trying to unblock some use cases by first adding the pretraining classes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok that's fine by me!
""", | ||
FLAVA_START_DOCSTRING.format(config="FLAVACodebookConfig"), | ||
) | ||
class FLAVACodebook(FLAVAPreTrainedModel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to understand this a bit better - would this model class every be used as a stand-alone model or in inference? Or does it purely exist for pretraining?
If this model is only uses for pretraining (MIM), I'd prefer to not make it its own class, but to just add it as a submodule of ...ForPreTraining
. It's quite important that we only have one modeling file IMO and don't create dependencies between different model repos that can get out of sync.
The way I currently understand it is that this model class should never be used as a standalone model
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are multiple reasons:
- Codebook generates mim labels only which are not needed for running the inference on the model. It will unnecessarily bloat the model if we added codebook to the pretrained model even if the user doesn't need. As others have discussed, logic for generating
mim_labels
should be independent of the model itself. - There can be any other codebook used with flava not just this one. Keeping it separate helps with that.
- FLAVACodebook can also be used independently of FLAVA for other projects. It is very similar to DALLE's Encoder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok but then let's just add codebook
to ForPreTraining
but not to Model
then it would not bloat inference no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same for ForImageClassification
- no need to add it there, but think it's better to add it to ForPretraining
so that a single forward pass of ForPretraining
is enough to get a loss
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A lot of tasks such as WinoGround, zero-shot classification, ITM retrieval use ForPretraining
out-of-the-box and that's what I meant by inference. I would really like to avoid generating MIM labels unnecessarily inside the model as they are not even needed for inference. This is in general the case for targets
no? Models are provided targets they don't generate them, unless it is a teacher-student model but this case it is not. Wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I might not have been very clear here.
For 2) the codebook should not generate the labels, it should process the image to the labels (the codebook is an encoder if I understand correctly), which means that by passing a mask that states which part of the codebook output is masked and pixel_values
or whatever will be processed by the codebook, we get the labels. It's not like the codebook model generates labels out of nothing no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's what is done for Wav2Vec2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For me having a single pytorch file / single model repo is important which is what I'm after here.
I don't think we have any architectures that require multiple model repos / model directories for inference, fine-tuning , or pretraining (no?). I'd like to keep it that way. RAG, CLIP are all composite models, but they are self-contained in a single model repo.
cc @patil-suraj @LysandreJik as well here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copy pasting some offline discussion that I had with Patrick for completeness:
P:
Yeah I think, we're more or less on the same page. Also think we should try to avoid unnecessary using memory and labels should not be created inside the model. But if the creation of labels depends on pretrained weights (which seems to be the case here and which is also the case for Wav2Vec2) IMO we should add the weights to the model architecture and pass something like mim_label_mask_indices that will be the indices that are used to mask the output of codebook
This has the following advantages:
- Label creation is deterministic because it all depends on mim_label_mask_indices
- Single pytorch file for training (that's very important for us). Otherwise people don't have a self-contained model folder that contains all the weights when doing model.save_pretrained("...") and it creates difficult dependencies between repos
- We don't really try to have a modularized architecture style, rather a self-contained style
A:
I think that is bool_masked_pos.
What you are saying is correct. Codebook generatesmim_labels
but they are masked usingbool_masked_pos
to understand which one we need to predict.
P:
Cool couldn't we have
bool_masked_pos
be an input to ForPretraining ? Just like in Wav2Vec2?
A:
We have it as of now. mim_labels are masked using this later on. (before Cross Entropy calculation)
For the moment, I have been using same pretrained weights for FlavaModel and FlavaForPretraining. Using the FlavaForPretraining weights to load the FlavaModel will throw a lot of unexpected keys error but that it fine I guess?
P:
This is expected - that's happens everytime I call Wav2Vec2Model.from_pretrained(...)
Also happens everytime you do BertModel.from_pretrained("bert-base-cased")
A:
Sounds good. I think I am okay with this. I will also add an option
add_codebook
which is true by default then?
I would also allow user to passmim_labels
to the forward. If they do, we use those otherwise we use codebook to generate the labels and mask them.
Only thing that is left is how do we generate mlm_labels more elegantly.
The DataCollator doesn't work for me as it just updates input_ids which we need as it is. Maybe a new data collator, I am not sure. There should be something simpler.
P:
That's what I do for Wav2Vec2:
def _sample_negative_indices(
Then this:for pretrainingtransformers/examples/pytorch/speech-pretraining/run_wav2vec2_pretraining_no_trainer.py
Line 326 in 67ed0e4
sampled_negative_indices = _sample_negative_indices(
A:
I also remembered that the feature extractor will also need to now output separate processed image for code book and forward needs to take it as a parameter. That was one more reason to keep them separate.
P:
Ah interesting, so you need 1 tokenizer and 2 feature extractors for flava?
IMO, the best approach would then be to add areturn_pixel_values_for_codebook
parameter to the feature extractor
Actually it'd be good if we would discuss this public on GitHub so that Sylvain, Lysandre and Suraj could also take a look and give their opinion 🙂
so after that we are back to GitHub.
Would be good to have your opinions @sgugger @LysandreJik @patil-suraj
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that everything should be in one repo, we never use two separate repos for one model, so that does imply the codebook should be part of the model that uses it, to be in the same checkpoint (and if the same codebook weights can be reused for different FLAVA models, it's completely supported if the same codebook weights are in all the checkpoints).
The only alternatives (while keeping a single repo) would be to have the weights of the codebook with some adhoc prefix corresponding to the base prefix of the codebook and different than the base prefix of the other FLAVA models, so rather a ugly hack; or use a different branch for the codebook, which would result in codebook = Xxx.from_pretrainec(checkpoint, revision="codebook" )
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Model looks super cool! Very excited to try it out!
My main remarks are:
-
- Why don't we add any heads for image classification or text classification? FLAVA got pretty good results on those no?
-
- Not a fan of loading checkpoints from two different repos when training / inference. Think
FlavaCodebook
should be refactored as a submodule to all classes that actually require it (just pretraining?)
- Not a fan of loading checkpoints from two different repos when training / inference. Think
Hi! I was wondering if there are any updates on this and if this will be merged soon? |
814ed8f
to
806a056
Compare
This PR aims to add [FLAVA](ihttps://arxiv.org/abs/2112.04482) model to the transformers repo. Following checklist delineates the list of things to be done for this PR to be complete: [x] Flava init [x] Flava base models [x] Flava layers [x] Flava Configs [x] Flava encoders [x] Flava pretraining models [ ] Flava classification/retrieval models (in progress) [x] Documentation updates (in progress) [x] Imports updates (in progress) [x] Argstring updates [x] Flava pretrained checkpoints (in progress) [x] Flava tests [x] Flava processors (in progress) [x] Sanity check [x] Lint
806a056
to
29953bb
Compare
@kshitijkg This is almost ready and should land in the core soon. @sgugger @patrickvonplaten This is ready for another round of review. I have resolved all of your comments except a few where I have asked clarifications. Hoping to land it soon. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for addressing all comments. I'll let @patrickvonplaten comment on the part he felt strongly about, for me all looks good apart from the two preprocessing functions inside the modeling file.
docs/source/en/model_doc/flava.mdx
Outdated
The FLAVA model was proposed in [FLAVA: A Foundational Language And Vision Alignment Model | ||
](https://arxiv.org/abs/2112.04482) by Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, and Douwe Kiela and is accepted at CVPR 2022. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The FLAVA model was proposed in [FLAVA: A Foundational Language And Vision Alignment Model | |
](https://arxiv.org/abs/2112.04482) by Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, and Douwe Kiela and is accepted at CVPR 2022. | |
The FLAVA model was proposed in [FLAVA: A Foundational Language And Vision Alignment Model](https://arxiv.org/abs/2112.04482) by Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, and Douwe Kiela and is accepted at CVPR 2022. |
return tuple(self[k] if k not in transformer_outputs else getattr(self, k).to_tuple() for k in self.keys()) | ||
|
||
|
||
def add_masked_text( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is and add_whole_word_masked_text
are not linked to the model per se, so they should go somewhere else (probably in an example script showing how to use FLAVA). It's as if the word masking code was in the BERT file (which it's not).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's only relevant for pretraining no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, I am somewhat conflicted about these. I think Flava is incomplete without these and I am also using them in the tests to verify that pretraining loss calculation logic is working as expected. These are actually similar in spirit to the processing functions in modeling_wav2vec2
def _compute_mask_indices( |
processing_flava
so as not to pollute modeling code if that helps. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that while _compute_mask_indices
is also used for pretraining in Wav2Vec2, it's main purpose is to apply SpecAugment during fine-tuning (it's called within the modedl) which is similar in spirit to dropout. Also no tokenizer logic is used in _compute_mask_indices
, it's therefore not dependent on any settings of the tokenizer, but just the model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still would think it's better to not put them there mainly because we want to be Flava have a similar design as BERT no? BERT's mlm code can be found in the examples
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main points for me are:
add_masked_text
to me seems only be used in Pretraining- is not called within the model, but prepares input and labels
- depends on the tokenizer
- BERT's mlm is in the examples
=> because of those points, I'd favor not putting it in the modeling file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic for mlm is inside a data collator and people usually these days want to prove the pretrain model for their pretraining objectives as well (I want to probe it actually as well), so that’s why I believe having this as a separate function would make sense (even if only in processing_flava file). Another question, how would I do the integration test for users without these function, hardcode masked indices?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have removed these for now. I will look into some other alternative to provide them in easier way later.
) | ||
class FLAVAImageModel(FLAVAPreTrainedModel): | ||
config_class = FLAVAImageConfig | ||
base_model_prefix = "flava.image_model" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I understand what you're trying to do. This is a current limitation of the PreTrainedModel
API, as base_model_prefix
can only be one value, not several depending on the model. So this is indeed the best way to achieve this.
"Image.Image", # noqa | ||
np.ndarray, | ||
"torch.Tensor", # noqa | ||
List["Image.Image"], # noqa | ||
List[np.ndarray], | ||
List["torch.Tensor"], # noqa |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hate those kinds of type annotation anyway (not sure who it's actually helping honestly, when it's long like this), so do as you prefer :-)
@slow | ||
def test_inference(self): | ||
model_name = "facebook/flava-full" | ||
model = FlavaModel.from_pretrained(model_name).to(torch_device) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for putting everything in one model!
embeddings = self.dropout(embeddings) | ||
return embeddings | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a # Copied from ...
BertSelfAttention here? Aren't they exactly the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok BERT allows one to also pass encoder hidden states in case it's used as a decoder cross attention layer. Think we could have the same logic here. But also ok for me to not add it if you think it's doesn't fit Flava
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comment. I think that is not useful for FLAVA so I would avoid it for now.
return outputs | ||
|
||
|
||
class FlavaIntermediate(nn.Module): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a # Copied from ...
BERT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very cool! PR is also good to go for me. Agree with @sgugger that preprocessing logic for pretraining should not go into the model file, but just live in an example (pretraining_flava.py
under examples). For now maybe we'll just remove it and keep it alive it a PR in case we want to add an example?
Also, we could add more # Copied from
statements from BERT which would make the code more robust and easier to maintain, but I understand if you don't want this since it would add encoder_attention_mask, encoder_hidden_states logic which would only be useful when using Flava as a decoder => so also ok for me to not do it
* [WIP] Add FLAVA model This PR aims to add [FLAVA](ihttps://arxiv.org/abs/2112.04482) model to the transformers repo. Following checklist delineates the list of things to be done for this PR to be complete: [x] Flava init [x] Flava base models [x] Flava layers [x] Flava Configs [x] Flava encoders [x] Flava pretraining models [ ] Flava classification/retrieval models (To be added in a separate PR) [x] Documentation updates [x] Imports updates [x] Argstring updates [x] Flava pretrained checkpoints [x] Flava tests [x] Flava processors [x] Sanity check [x] Lint
* [WIP] Add FLAVA model This PR aims to add [FLAVA](ihttps://arxiv.org/abs/2112.04482) model to the transformers repo. Following checklist delineates the list of things to be done for this PR to be complete: [x] Flava init [x] Flava base models [x] Flava layers [x] Flava Configs [x] Flava encoders [x] Flava pretraining models [ ] Flava classification/retrieval models (To be added in a separate PR) [x] Documentation updates [x] Imports updates [x] Argstring updates [x] Flava pretrained checkpoints [x] Flava tests [x] Flava processors [x] Sanity check [x] Lint
* [WIP] Add FLAVA model This PR aims to add [FLAVA](ihttps://arxiv.org/abs/2112.04482) model to the transformers repo. Following checklist delineates the list of things to be done for this PR to be complete: [x] Flava init [x] Flava base models [x] Flava layers [x] Flava Configs [x] Flava encoders [x] Flava pretraining models [ ] Flava classification/retrieval models (To be added in a separate PR) [x] Documentation updates [x] Imports updates [x] Argstring updates [x] Flava pretrained checkpoints [x] Flava tests [x] Flava processors [x] Sanity check [x] Lint
This PR aims to add FLAVA model to the transformers repo.
Following checklist delineates the list of things to be done for this PR
to be complete: