-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature request] Support for external modality for language datasets #263
Comments
Thanks a lot, @aleSuglia for the very detailed and introductive feature request. One of the questions here is that Arrow doesn't have built-in support for generic "tensors" in records but there might be ways to do that in a clean way. We'll probably try to tackle this during the summer. |
I was looking into Facebook MMF and apparently they decided to use LMDB to store additional features associated with every example: https:/facebookresearch/mmf/blob/master/mmf/datasets/databases/features_database.py |
I saw the Mozilla common_voice dataset in model hub, which has mp3 audio recordings as part it. It's use predominantly maybe in ASR and TTS, but dataset is a Language + Voice Dataset similar to @aleSuglia's point about Language + Vision. |
Hey @thomwolf, are there any updates on this? I would love to contribute if possible! Thanks, |
Hi @aleSuglia :) In today's new release 1.17 of Note that both Image and Audio feature types take care of decoding the images/audio data if needed. The returned images are PIL images, and the audio signals are decoded as numpy arrays. And |
Background
In recent years many researchers have advocated that learning meanings from text-based only datasets is just like asking a human to "learn to speak by listening to the radio" [E. Bender and A. Koller,2020, Y. Bisk et. al, 2020]. Therefore, the importance of multi-modal datasets for the NLP community is of paramount importance for next-generation models. For this reason, I raised a concern related to the best way to integrate external features in NLP datasets (e.g., visual features associated with an image, audio features associated with a recording, etc.). This would be of great importance for a more systematic way of representing data for ML models that are learning from multi-modal data.
Language + Vision
Use case
Typically, people working on Language+Vision tasks, have a reference dataset (either in JSON or JSONL format) and for each example, they have an identifier that specifies the reference image. For a practical example, you can refer to the GQA dataset.
Currently, images are represented by either pooling-based features (average pooling of ResNet or VGGNet features, see DeVries et.al, 2017, Shekhar et.al, 2019) where you have a single vector for every image. Another option is to use a set of feature maps for every image extracted from a specific layer of a CNN (see Xu et.al, 2015). A more recent option, especially with large-scale multi-modal transformers Li et. al, 2019, is to use FastRCNN features.
For all these types of features, people use one of the following formats:
Implementation considerations
I was thinking about possible ways of implementing this feature. As mentioned above, depending on the model, different visual features can be used. This step usually relies on another model (say ResNet-101) that is used to generate the visual features for each image used in the dataset. Typically, this step is done in a separate script that completes the feature generation procedure. The usual processing steps for these datasets are the following:
In my personal projects, I've decided to ignore HD5F because it doesn't have out-of-the-box support for multi-processing (see this PyTorch issue). I've been successfully using a NumPy compressed file for each image so that I can store any sort of information in it.
For ease of use of all these Language+Vision datasets, it would be really handy to have a way to associate the visual features with the text and store them in an efficient way. That's why I immediately thought about the HuggingFace NLP backend based on Apache Arrow. The assumption here is that the external modality will be mapped to a N-dimensional tensor so easily represented by a NumPy array.
Looking forward to hearing your thoughts about it!
The text was updated successfully, but these errors were encountered: