-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multidimensional arrays in a Dataset #2080
Comments
Hi ! This is actually supported ! but not yet in from datasets import Dataset, Array2D, Features, Value
import pandas as pd
import numpy as np
dataset = {
'bbox': [
np.array([[1,2,3,4],[1,2,3,4],[1,2,3,4]]),
np.array([[1,2,3,4],[1,2,3,4],[1,2,3,4]]),
np.array([[1,2,3,4],[1,2,3,4],[1,2,3,4]]),
np.array([[1,2,3,4],[1,2,3,4],[1,2,3,4]])
],
'input_ids': [1, 2, 3, 4]
}
dataset = Dataset.from_dict(dataset) This will work but to use it with the torch formatter you must specify the from datasets import Dataset, Array2D, Features, Value
import pandas as pd
import numpy as np
dataset = {
'bbox': [
np.array([[1,2,3,4],[1,2,3,4],[1,2,3,4]]),
np.array([[1,2,3,4],[1,2,3,4],[1,2,3,4]]),
np.array([[1,2,3,4],[1,2,3,4],[1,2,3,4]]),
np.array([[1,2,3,4],[1,2,3,4],[1,2,3,4]])
],
'input_ids': [1, 2, 3, 4]
}
dataset = Dataset.from_dict(dataset, features=Features({
"bbox": Array2D(shape=(3, 4), dtype="int64"),
"input_ids": Value("int64")
}))
dataset.set_format("torch")
print(dataset[0]['bbox'])
# tensor([[1, 2, 3, 4],
# [1, 2, 3, 4],
# [1, 2, 3, 4]]) If you don't specify the |
Thanks for the explanation.
and then the rest of the transformation from dictionary works just fine. |
Hi,
I'm trying to put together a
datasets.Dataset
to be used with LayoutLM which is available intransformers
. This model requires as input the bounding boxes of each of the token of a sequence. This is when I realized thatDataset
does not support multi-dimensional arrays as a value for a column in a row.The following code results in conversion error in pyarrow (
pyarrow.lib.ArrowInvalid: ('Can only convert 1-dimensional array values', 'Conversion failed for column bbox with type object')
)Since I wanted to use pytorch for the downstream training task, I also tried a few ways to directly put in a column of 2-D pytorch tensor in a formatted dataset, but I can only have a list of 1-D tensors, or a list of arrays, or a list of lists.
Is is possible to support n-D arrays/tensors in datasets?
It seems that it can also be useful for this feature request.
The text was updated successfully, but these errors were encountered: