Choosing data format[s] for creating generic inference datasets #197

lukeyeager · 2015-08-06T19:21:37Z

[Branching discussion in #189 into separate thread]

In order to create the dataset, is it possible to have the user specify [train|val|test].txt files in the form of:
/path/to/file [y1,...,yn]
It would be nice if DIGITS could create the image and label databases from these files (in theory that would allow the user to use non-image files too).
#189 (comment)

I'd like to enable at least three methods for people to create "Generic Inference" datasets.

By uploading prebuilt LMDBs (see discussion here about how many LMDBs to allow)
With textfiles and a list of floats as @gheinrich mentioned above
By parsing a folder
- We could expected a .npy containing an n-dimensional label file for each image like so:

 images/
    ├── image1.png
    ├── image1.npy
    ├── image2.jpg
    └── image2.npy

The text was updated successfully, but these errors were encountered:

lukeyeager · 2015-08-07T17:22:07Z

More thoughts from @Pastafarianist at #97 (comment):

Each line in such text file is currently matched against (.+)\s+(\d+)\s_$ (path/to/image 123). This could be replaced with (.+)((?:\s+\d+(?:.\d_))+)\s*$ to check for a list of ints or floats (path/to/image 123 4. 5.67).

That's pretty much the same as option (2) above.

crohkohl · 2015-08-12T08:56:23Z

Why not load caffe datum as defined in the protocol buffers and give it the file extension *.cdatum?

lukeyeager · 2015-08-12T20:25:03Z

@crohkohl, thanks for the input! Are you suggesting replacing option (3), or creating another option?

I'd like to move away from Caffe-specific solutions where possible. You do bring up a good point, though, which is that Datum's can only be 3-dimensional, whereas numpy.ndarray's can be N-dimensional. So I guess we'd have to throw an error if you tried to supply a 4d label. That, or start supporting HDF5 in DIGITS.

crohkohl · 2015-08-12T20:31:23Z

Are you suggesting replacing option (3), or creating another option?

Well, I personally would choose a format that is independet from python for data import. I think a lot of people have tool chains other than python (like me). One could provide a custom protobuf format for example that could be exported from all supported languages.

I further would not only provide it as a label. It would also be great to actually load the data like that and not to be limited by standard 2d image formats like png. That way one could easily import 1d - 4d data and labels using the same mechanism into the database for caffe.

And isn't the caffe-blob structure N-D aswell? They have a N-D shape if I recall correctly.

lukeyeager · 2015-08-12T21:14:11Z

Ok, that's two good suggestions:

Non-python format
- Good point, especially if we want to support Torch.
- Custom protobuf doesn't like the right choice. Any other ideas?
Non-image data
- I had been thinking of this as a separate issue, with a different interface for creating video or audio datasets. Maybe they should all be merged into the same interface?
- Could we ask the user which type of data they want and look for appropriate file extensions?

And isn't the caffe-blob structure N-D aswell? They have a N-D shape if I recall correctly.

BlobProto is N-D, but Datum is 3D, and Caffe's Data layer expects Datums in LMDB

crohkohl · 2015-08-12T22:23:11Z

Custom protobuf doesn't like the right choice. Any other ideas?

I think the only option that is simple and portable is a raw binary format enriched with meta data provided by the user, e.g.: shape, datatype, compression.

As a side node:

Whatever you choose to implement will probably not be sufficient. So it should be considered to provide a plugin-design for data parsing.

The import-plugin interface class registers certain extensions and handles the loading by returning a 3d datum and a nd-label. That way it could be easily extended by the community without having to interfer with a lot of code.

Further, in the gui the user could choose what importers shall be used. Upon selection a custom parameter dialog could be shown by the importer plugin.

jms90h5 · 2015-08-15T15:32:03Z

I'm one of those users interested in "audio". I put audio in quotes because that general type of time series data can also encompass many other similar use cases. For example. EKG heart waveforms, EEG brain waveforms (which happens to be one of my use cases). Heck even sensor data from heavy equipment basically has the same nature of being a one dimensional signal and can be processed in the same way as audio.

I agree with crohkohl that some kind of plugin with documentation and perhaps an example implementation or two is probably the right way to go. Even for just actual audio there are a ton of formats and configuration variables. Furthermore, in keeping with the general-inference theme it would make sense for the work involved with this enhancement to also be used in a very general way. For example, in addition to the actual waveform samples, classification can be performed on feature vectors generated from them. The feature vectors are basically are just another representation of the same data (and still a sequence of numeric values). Currently I'm pretending the values in the feature vectors are pixels in an image to allow me to use Digits with little or no modification. But it would be nice to not have to do that.

In another discussion there is an explanation of how to generate LMDB's separately from Digits that caffe can operate on. While that is an OK workaround it defeats part of the convenience and low barrier for entry that Digits provides. While it's quite possible part of the operation of the plugin would be to call that same db creation code, it would be nice to at least have a bit of a wrapper around it to integrate it into the Digits UI. For example, some way to tell Digits to use a specific plugin rather than one of the existing input mechanisms.

lukeyeager · 2015-09-02T00:02:45Z

How about a CSV (or TSV) like this:

x,          y
(2),        (1)
ndarray,    value
[2,3],      0.6667
[5,2],      2.5
...

By row, that's the (1) name and (2) shape of the tensor, and then the (3) type and (4-*) values that will provide the data for the tensor. Here's how it would look for the classic image classification problem:

data,               label
(256,256,3),        (1)
file,               value
/images/cat/1.jpg,  0
/images/dog/1.png,  1
...

For segmentations:

image,              segmentations
(200,100),          (200,100,20)
file,               file
/data/1_img.png,    /data/1_seg.npy
/data/2_img.png,    /data/2_seg.npy
...

So, we'd have two types of plugins:

Data type plugins (value,ndarray,file, etc.)
File type plugins for different file extensions with the file data type (image, numpy, csv, etc.)

FridayAccessory123 · 2015-09-16T11:28:58Z

@lukeyeager, for what it's worth I'd like to second @gheinrich's request above - it would be so simple if we could just specify a vector of labels in the text files! Are we likely to see this feature any time soon? Thanks.

lukeyeager · 2015-09-16T16:42:58Z

@FridayAccessory123 ok, thanks for the feedback! Does the solution in my previous post (#197 (comment)) seem too complex? If you wanted bounding boxes instead of classifications, you could format the CSV like this:

image,               boxes
(256,256,3),        (2,2)
file,               ndarray
/images/scene1.jpg,  [[0.1,0.1],[0.3,0.3]]
/images/scene2.png,  [[0.2,0.4],[0.4,0.6]]
...

FridayAccessory123 · 2015-09-17T13:22:11Z

@lukeyeager, not too complex at all - a simple vector of ints would be enough for me right now but your solution certainly looks like the way forward more generally. Thanks again.

lukeyeager · 2015-10-08T18:35:53Z

/cc @Deepomatic

thiagoribeirodamotta · 2015-11-05T22:42:35Z

Are there any news on this?

sodeypunk · 2016-02-04T04:46:23Z

We really need the feature to support this. Would like some update around this as well!

lukeyeager · 2016-02-04T18:02:36Z

Our main focus right now is around running DIGITS on a cluster of machines (related to #108). As a precursor, I'm working on moving all the stored data over to a SQL database instead of pickle files (#566). I'm intentionally keeping the database schema as data-agnostic as possible, and also as framework-agnostic as possible. So hopefully our work towards that major new feature will make generic datasets easier to implement in the future.

This was referenced Aug 6, 2015

Add initial support for generic inference #189

Merged

Add support for multiple and/or floating point labels #97

Closed

lukeyeager added the question label Aug 7, 2015

This was referenced Aug 18, 2015

Support HDF5 #224

Closed

Add support for HDF5 datasets #226

Merged

lukeyeager mentioned this issue Feb 25, 2016

Add support for multichannel images #600

Open

gheinrich mentioned this issue Feb 29, 2016

Digits For Audio datasets #607

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Choosing data format[s] for creating generic inference datasets #197

Choosing data format[s] for creating generic inference datasets #197

lukeyeager commented Aug 6, 2015

lukeyeager commented Aug 7, 2015

crohkohl commented Aug 12, 2015

lukeyeager commented Aug 12, 2015

crohkohl commented Aug 12, 2015

lukeyeager commented Aug 12, 2015

crohkohl commented Aug 12, 2015

jms90h5 commented Aug 15, 2015

lukeyeager commented Sep 2, 2015 •

edited

Loading

FridayAccessory123 commented Sep 16, 2015

lukeyeager commented Sep 16, 2015

FridayAccessory123 commented Sep 17, 2015

lukeyeager commented Oct 8, 2015

thiagoribeirodamotta commented Nov 5, 2015

sodeypunk commented Feb 4, 2016

lukeyeager commented Feb 4, 2016

Choosing data format[s] for creating generic inference datasets #197

Choosing data format[s] for creating generic inference datasets #197

Comments

lukeyeager commented Aug 6, 2015

lukeyeager commented Aug 7, 2015

crohkohl commented Aug 12, 2015

lukeyeager commented Aug 12, 2015

crohkohl commented Aug 12, 2015

lukeyeager commented Aug 12, 2015

crohkohl commented Aug 12, 2015

jms90h5 commented Aug 15, 2015

lukeyeager commented Sep 2, 2015 • edited Loading

FridayAccessory123 commented Sep 16, 2015

lukeyeager commented Sep 16, 2015

FridayAccessory123 commented Sep 17, 2015

lukeyeager commented Oct 8, 2015

thiagoribeirodamotta commented Nov 5, 2015

sodeypunk commented Feb 4, 2016

lukeyeager commented Feb 4, 2016

lukeyeager commented Sep 2, 2015 •

edited

Loading