Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature request] Support for external modality for language datasets #263

Closed
aleSuglia opened this issue Jun 11, 2020 · 5 comments
Closed
Labels
enhancement New feature or request generic discussion Generic discussion on the library

Comments

@aleSuglia
Copy link
Contributor

aleSuglia commented Jun 11, 2020

Background

In recent years many researchers have advocated that learning meanings from text-based only datasets is just like asking a human to "learn to speak by listening to the radio" [E. Bender and A. Koller,2020, Y. Bisk et. al, 2020]. Therefore, the importance of multi-modal datasets for the NLP community is of paramount importance for next-generation models. For this reason, I raised a concern related to the best way to integrate external features in NLP datasets (e.g., visual features associated with an image, audio features associated with a recording, etc.). This would be of great importance for a more systematic way of representing data for ML models that are learning from multi-modal data.

Language + Vision

Use case

Typically, people working on Language+Vision tasks, have a reference dataset (either in JSON or JSONL format) and for each example, they have an identifier that specifies the reference image. For a practical example, you can refer to the GQA dataset.

Currently, images are represented by either pooling-based features (average pooling of ResNet or VGGNet features, see DeVries et.al, 2017, Shekhar et.al, 2019) where you have a single vector for every image. Another option is to use a set of feature maps for every image extracted from a specific layer of a CNN (see Xu et.al, 2015). A more recent option, especially with large-scale multi-modal transformers Li et. al, 2019, is to use FastRCNN features.

For all these types of features, people use one of the following formats:

  1. HD5F
  2. NumPy
  3. LMDB

Implementation considerations

I was thinking about possible ways of implementing this feature. As mentioned above, depending on the model, different visual features can be used. This step usually relies on another model (say ResNet-101) that is used to generate the visual features for each image used in the dataset. Typically, this step is done in a separate script that completes the feature generation procedure. The usual processing steps for these datasets are the following:

  1. Download dataset
  2. Download images associated with the dataset
  3. Write a script that generates the visual features for every image and store them in a specific file
  4. Create a DataLoader that maps the visual features to the corresponding language example

In my personal projects, I've decided to ignore HD5F because it doesn't have out-of-the-box support for multi-processing (see this PyTorch issue). I've been successfully using a NumPy compressed file for each image so that I can store any sort of information in it.

For ease of use of all these Language+Vision datasets, it would be really handy to have a way to associate the visual features with the text and store them in an efficient way. That's why I immediately thought about the HuggingFace NLP backend based on Apache Arrow. The assumption here is that the external modality will be mapped to a N-dimensional tensor so easily represented by a NumPy array.

Looking forward to hearing your thoughts about it!

@lhoestq lhoestq added enhancement New feature or request generic discussion Generic discussion on the library labels Jun 12, 2020
@thomwolf
Copy link
Member

Thanks a lot, @aleSuglia for the very detailed and introductive feature request.
It seems like we could build something pretty useful here indeed.

One of the questions here is that Arrow doesn't have built-in support for generic "tensors" in records but there might be ways to do that in a clean way. We'll probably try to tackle this during the summer.

@aleSuglia
Copy link
Contributor Author

I was looking into Facebook MMF and apparently they decided to use LMDB to store additional features associated with every example: https:/facebookresearch/mmf/blob/master/mmf/datasets/databases/features_database.py

@Abhishek-P
Copy link

Abhishek-P commented Apr 1, 2021

I saw the Mozilla common_voice dataset in model hub, which has mp3 audio recordings as part it. It's use predominantly maybe in ASR and TTS, but dataset is a Language + Voice Dataset similar to @aleSuglia's point about Language + Vision.

https://huggingface.co/datasets/common_voice

@aleSuglia
Copy link
Contributor Author

Hey @thomwolf, are there any updates on this? I would love to contribute if possible!

Thanks,
Alessandro

@lhoestq
Copy link
Member

lhoestq commented Dec 21, 2021

Hi @aleSuglia :) In today's new release 1.17 of datasets we introduce a new feature type Image that allows to store images directly in a dataset, next to text features and labels for example. There is also an Audio feature type, for datasets containing audio data. For tensors there are Array2D, Array3D, etc. feature types

Note that both Image and Audio feature types take care of decoding the images/audio data if needed. The returned images are PIL images, and the audio signals are decoded as numpy arrays.

And datasets also leverage end-to-end zero copy from the arrow data for all of them, for maximum speed :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request generic discussion Generic discussion on the library
Projects
None yet
Development

No branches or pull requests

5 participants