Multidimensional arrays in a Dataset #2080

vermouthmjl · 2021-03-18T16:29:14Z

Hi,

I'm trying to put together a datasets.Dataset to be used with LayoutLM which is available in transformers. This model requires as input the bounding boxes of each of the token of a sequence. This is when I realized that Dataset does not support multi-dimensional arrays as a value for a column in a row.

The following code results in conversion error in pyarrow (pyarrow.lib.ArrowInvalid: ('Can only convert 1-dimensional array values', 'Conversion failed for column bbox with type object'))

from datasets import Dataset
import pandas as pd
import numpy as np

dataset = pd.DataFrame({
    'bbox': [
        np.array([[1,2,3,4],[1,2,3,4],[1,2,3,4]]),
        np.array([[1,2,3,4],[1,2,3,4],[1,2,3,4]]),
        np.array([[1,2,3,4],[1,2,3,4],[1,2,3,4]]),
        np.array([[1,2,3,4],[1,2,3,4],[1,2,3,4]])
    ],
    'input_ids': [1, 2, 3, 4]
})
dataset = Dataset.from_pandas(dataset)

Since I wanted to use pytorch for the downstream training task, I also tried a few ways to directly put in a column of 2-D pytorch tensor in a formatted dataset, but I can only have a list of 1-D tensors, or a list of arrays, or a list of lists.

import torch
from datasets import Dataset
import pandas as pd

dataset = pd.DataFrame({
    'bbox': [
        [[1,2,3,4],[1,2,3,4],[1,2,3,4]],
        [[1,2,3,4],[1,2,3,4],[1,2,3,4]],
        [[1,2,3,4],[1,2,3,4],[1,2,3,4]],
        [[1,2,3,4],[1,2,3,4],[1,2,3,4]]
    ],
    'input_ids': [1, 2, 3, 4]
})
dataset = Dataset.from_pandas(dataset)

def test(examples):
    return {'bbbox': torch.Tensor(examples['bbox'])}
dataset = dataset.map(test)
print(dataset[0]['bbox'])
print(dataset[0]['bbbox'])

dataset.set_format(type='torch', columns=['input_ids', 'bbox'], output_all_columns=True)
print(dataset[0]['bbox'])
print(dataset[0]['bbbox'])

def test2(examples):
    return {'bbbox': torch.stack(examples['bbox'])}
dataset = dataset.map(test2)

print(dataset[0]['bbox'])
print(dataset[0]['bbbox'])

Is is possible to support n-D arrays/tensors in datasets?
It seems that it can also be useful for this feature request.

The text was updated successfully, but these errors were encountered:

lhoestq · 2021-03-19T19:34:57Z

Hi !

This is actually supported ! but not yet in from_pandas.
You can use from_dict for now instead:

from datasets import Dataset, Array2D, Features, Value
import pandas as pd
import numpy as np

dataset = {
    'bbox': [
        np.array([[1,2,3,4],[1,2,3,4],[1,2,3,4]]),
        np.array([[1,2,3,4],[1,2,3,4],[1,2,3,4]]),
        np.array([[1,2,3,4],[1,2,3,4],[1,2,3,4]]),
        np.array([[1,2,3,4],[1,2,3,4],[1,2,3,4]])
    ],
    'input_ids': [1, 2, 3, 4]
}
dataset = Dataset.from_dict(dataset)

This will work but to use it with the torch formatter you must specify the Array2D feature type in order to tell the shape:

from datasets import Dataset, Array2D, Features, Value
import pandas as pd
import numpy as np

dataset = {
    'bbox': [
        np.array([[1,2,3,4],[1,2,3,4],[1,2,3,4]]),
        np.array([[1,2,3,4],[1,2,3,4],[1,2,3,4]]),
        np.array([[1,2,3,4],[1,2,3,4],[1,2,3,4]]),
        np.array([[1,2,3,4],[1,2,3,4],[1,2,3,4]])
    ],
    'input_ids': [1, 2, 3, 4]
}
dataset = Dataset.from_dict(dataset, features=Features({
    "bbox": Array2D(shape=(3, 4), dtype="int64"),
    "input_ids": Value("int64")
}))
dataset.set_format("torch")
print(dataset[0]['bbox'])
# tensor([[1, 2, 3, 4],
#         [1, 2, 3, 4],
#         [1, 2, 3, 4]])

If you don't specify the Array2D feature type, then the inferred type will be Sequence(Sequence(Value("int64"))) and therefore the torch formatter will return list of tensors

vermouthmjl · 2021-03-25T12:46:53Z

Thanks for the explanation.
With my original DataFrame, I did

dataset = dataset.to_dict("list")

and then the rest of the transformation from dictionary works just fine.

vermouthmjl closed this as completed Mar 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multidimensional arrays in a Dataset #2080

Multidimensional arrays in a Dataset #2080

vermouthmjl commented Mar 18, 2021 •

edited

Loading

lhoestq commented Mar 19, 2021

vermouthmjl commented Mar 25, 2021

Multidimensional arrays in a Dataset #2080

Multidimensional arrays in a Dataset #2080

Comments

vermouthmjl commented Mar 18, 2021 • edited Loading

lhoestq commented Mar 19, 2021

vermouthmjl commented Mar 25, 2021

vermouthmjl commented Mar 18, 2021 •

edited

Loading