Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multidimensional arrays in a Dataset #2080

Closed
vermouthmjl opened this issue Mar 18, 2021 · 2 comments
Closed

Multidimensional arrays in a Dataset #2080

vermouthmjl opened this issue Mar 18, 2021 · 2 comments

Comments

@vermouthmjl
Copy link

vermouthmjl commented Mar 18, 2021

Hi,

I'm trying to put together a datasets.Dataset to be used with LayoutLM which is available in transformers. This model requires as input the bounding boxes of each of the token of a sequence. This is when I realized that Dataset does not support multi-dimensional arrays as a value for a column in a row.

The following code results in conversion error in pyarrow (pyarrow.lib.ArrowInvalid: ('Can only convert 1-dimensional array values', 'Conversion failed for column bbox with type object'))

from datasets import Dataset
import pandas as pd
import numpy as np

dataset = pd.DataFrame({
    'bbox': [
        np.array([[1,2,3,4],[1,2,3,4],[1,2,3,4]]),
        np.array([[1,2,3,4],[1,2,3,4],[1,2,3,4]]),
        np.array([[1,2,3,4],[1,2,3,4],[1,2,3,4]]),
        np.array([[1,2,3,4],[1,2,3,4],[1,2,3,4]])
    ],
    'input_ids': [1, 2, 3, 4]
})
dataset = Dataset.from_pandas(dataset)

Since I wanted to use pytorch for the downstream training task, I also tried a few ways to directly put in a column of 2-D pytorch tensor in a formatted dataset, but I can only have a list of 1-D tensors, or a list of arrays, or a list of lists.

import torch
from datasets import Dataset
import pandas as pd

dataset = pd.DataFrame({
    'bbox': [
        [[1,2,3,4],[1,2,3,4],[1,2,3,4]],
        [[1,2,3,4],[1,2,3,4],[1,2,3,4]],
        [[1,2,3,4],[1,2,3,4],[1,2,3,4]],
        [[1,2,3,4],[1,2,3,4],[1,2,3,4]]
    ],
    'input_ids': [1, 2, 3, 4]
})
dataset = Dataset.from_pandas(dataset)

def test(examples):
    return {'bbbox': torch.Tensor(examples['bbox'])}
dataset = dataset.map(test)
print(dataset[0]['bbox'])
print(dataset[0]['bbbox'])

dataset.set_format(type='torch', columns=['input_ids', 'bbox'], output_all_columns=True)
print(dataset[0]['bbox'])
print(dataset[0]['bbbox'])

def test2(examples):
    return {'bbbox': torch.stack(examples['bbox'])}
dataset = dataset.map(test2)

print(dataset[0]['bbox'])
print(dataset[0]['bbbox'])

Is is possible to support n-D arrays/tensors in datasets?
It seems that it can also be useful for this feature request.

@lhoestq
Copy link
Member

lhoestq commented Mar 19, 2021

Hi !

This is actually supported ! but not yet in from_pandas.
You can use from_dict for now instead:

from datasets import Dataset, Array2D, Features, Value
import pandas as pd
import numpy as np

dataset = {
    'bbox': [
        np.array([[1,2,3,4],[1,2,3,4],[1,2,3,4]]),
        np.array([[1,2,3,4],[1,2,3,4],[1,2,3,4]]),
        np.array([[1,2,3,4],[1,2,3,4],[1,2,3,4]]),
        np.array([[1,2,3,4],[1,2,3,4],[1,2,3,4]])
    ],
    'input_ids': [1, 2, 3, 4]
}
dataset = Dataset.from_dict(dataset)

This will work but to use it with the torch formatter you must specify the Array2D feature type in order to tell the shape:

from datasets import Dataset, Array2D, Features, Value
import pandas as pd
import numpy as np

dataset = {
    'bbox': [
        np.array([[1,2,3,4],[1,2,3,4],[1,2,3,4]]),
        np.array([[1,2,3,4],[1,2,3,4],[1,2,3,4]]),
        np.array([[1,2,3,4],[1,2,3,4],[1,2,3,4]]),
        np.array([[1,2,3,4],[1,2,3,4],[1,2,3,4]])
    ],
    'input_ids': [1, 2, 3, 4]
}
dataset = Dataset.from_dict(dataset, features=Features({
    "bbox": Array2D(shape=(3, 4), dtype="int64"),
    "input_ids": Value("int64")
}))
dataset.set_format("torch")
print(dataset[0]['bbox'])
# tensor([[1, 2, 3, 4],
#         [1, 2, 3, 4],
#         [1, 2, 3, 4]])

If you don't specify the Array2D feature type, then the inferred type will be Sequence(Sequence(Value("int64"))) and therefore the torch formatter will return list of tensors

@vermouthmjl
Copy link
Author

Thanks for the explanation.
With my original DataFrame, I did

dataset = dataset.to_dict("list")

and then the rest of the transformation from dictionary works just fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants