[Feature request] Support for external modality for language datasets #263

aleSuglia · 2020-06-11T13:42:18Z

Background

In recent years many researchers have advocated that learning meanings from text-based only datasets is just like asking a human to "learn to speak by listening to the radio" [E. Bender and A. Koller,2020, Y. Bisk et. al, 2020]. Therefore, the importance of multi-modal datasets for the NLP community is of paramount importance for next-generation models. For this reason, I raised a concern related to the best way to integrate external features in NLP datasets (e.g., visual features associated with an image, audio features associated with a recording, etc.). This would be of great importance for a more systematic way of representing data for ML models that are learning from multi-modal data.

Language + Vision

Use case

Typically, people working on Language+Vision tasks, have a reference dataset (either in JSON or JSONL format) and for each example, they have an identifier that specifies the reference image. For a practical example, you can refer to the GQA dataset.

Currently, images are represented by either pooling-based features (average pooling of ResNet or VGGNet features, see DeVries et.al, 2017, Shekhar et.al, 2019) where you have a single vector for every image. Another option is to use a set of feature maps for every image extracted from a specific layer of a CNN (see Xu et.al, 2015). A more recent option, especially with large-scale multi-modal transformers Li et. al, 2019, is to use FastRCNN features.

For all these types of features, people use one of the following formats:

Implementation considerations

I was thinking about possible ways of implementing this feature. As mentioned above, depending on the model, different visual features can be used. This step usually relies on another model (say ResNet-101) that is used to generate the visual features for each image used in the dataset. Typically, this step is done in a separate script that completes the feature generation procedure. The usual processing steps for these datasets are the following:

Download dataset
Download images associated with the dataset
Write a script that generates the visual features for every image and store them in a specific file
Create a DataLoader that maps the visual features to the corresponding language example

In my personal projects, I've decided to ignore HD5F because it doesn't have out-of-the-box support for multi-processing (see this PyTorch issue). I've been successfully using a NumPy compressed file for each image so that I can store any sort of information in it.

For ease of use of all these Language+Vision datasets, it would be really handy to have a way to associate the visual features with the text and store them in an efficient way. That's why I immediately thought about the HuggingFace NLP backend based on Apache Arrow. The assumption here is that the external modality will be mapped to a N-dimensional tensor so easily represented by a NumPy array.

Looking forward to hearing your thoughts about it!

thomwolf · 2020-06-20T20:51:27Z

Thanks a lot, @aleSuglia for the very detailed and introductive feature request.
It seems like we could build something pretty useful here indeed.

One of the questions here is that Arrow doesn't have built-in support for generic "tensors" in records but there might be ways to do that in a clean way. We'll probably try to tackle this during the summer.

aleSuglia · 2020-07-03T15:26:52Z

I was looking into Facebook MMF and apparently they decided to use LMDB to store additional features associated with every example: https:/facebookresearch/mmf/blob/master/mmf/datasets/databases/features_database.py

Abhishek-P · 2021-04-01T03:31:48Z

I saw the Mozilla common_voice dataset in model hub, which has mp3 audio recordings as part it. It's use predominantly maybe in ASR and TTS, but dataset is a Language + Voice Dataset similar to @aleSuglia's point about Language + Vision.

https://huggingface.co/datasets/common_voice

aleSuglia · 2021-12-21T15:08:47Z

Hey @thomwolf, are there any updates on this? I would love to contribute if possible!

Thanks,
Alessandro

lhoestq · 2021-12-21T16:59:06Z

Hi @aleSuglia :) In today's new release 1.17 of datasets we introduce a new feature type Image that allows to store images directly in a dataset, next to text features and labels for example. There is also an Audio feature type, for datasets containing audio data. For tensors there are Array2D, Array3D, etc. feature types

Note that both Image and Audio feature types take care of decoding the images/audio data if needed. The returned images are PIL images, and the audio signals are decoded as numpy arrays.

And datasets also leverage end-to-end zero copy from the arrow data for all of them, for maximum speed :)

lhoestq added enhancement New feature or request generic discussion Generic discussion on the library labels Jun 12, 2020

aleSuglia mentioned this issue Feb 5, 2021

Integration with Huggingface datasets allenai/allennlp#4962

Open

vermouthmjl mentioned this issue Mar 18, 2021

Multidimensional arrays in a Dataset #2080

Closed

albertvillanova closed this as completed Feb 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature request] Support for external modality for language datasets #263

[Feature request] Support for external modality for language datasets #263

aleSuglia commented Jun 11, 2020 •

edited

Loading

thomwolf commented Jun 20, 2020

aleSuglia commented Jul 3, 2020

Abhishek-P commented Apr 1, 2021 •

edited

Loading

aleSuglia commented Dec 21, 2021

lhoestq commented Dec 21, 2021

[Feature request] Support for external modality for language datasets #263

[Feature request] Support for external modality for language datasets #263

Comments

aleSuglia commented Jun 11, 2020 • edited Loading

Background

Language + Vision

Use case

Implementation considerations

thomwolf commented Jun 20, 2020

aleSuglia commented Jul 3, 2020

Abhishek-P commented Apr 1, 2021 • edited Loading

aleSuglia commented Dec 21, 2021

lhoestq commented Dec 21, 2021

aleSuglia commented Jun 11, 2020 •

edited

Loading

Abhishek-P commented Apr 1, 2021 •

edited

Loading