Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Choosing data format[s] for creating generic inference datasets #197

Open
lukeyeager opened this issue Aug 6, 2015 · 15 comments
Open

Choosing data format[s] for creating generic inference datasets #197

lukeyeager opened this issue Aug 6, 2015 · 15 comments
Labels

Comments

@lukeyeager
Copy link
Member

[Branching discussion in #189 into separate thread]

In order to create the dataset, is it possible to have the user specify [train|val|test].txt files in the form of:
/path/to/file [y1,...,yn]
It would be nice if DIGITS could create the image and label databases from these files (in theory that would allow the user to use non-image files too).
#189 (comment)

I'd like to enable at least three methods for people to create "Generic Inference" datasets.

  1. By uploading prebuilt LMDBs (see discussion here about how many LMDBs to allow)
  2. With textfiles and a list of floats as @gheinrich mentioned above
  3. By parsing a folder
    • We could expected a .npy containing an n-dimensional label file for each image like so:
 images/
    ├── image1.png
    ├── image1.npy
    ├── image2.jpg
    └── image2.npy
@lukeyeager
Copy link
Member Author

More thoughts from @Pastafarianist at #97 (comment):

Each line in such text file is currently matched against (.+)\s+(\d+)\s_$ (path/to/image 123). This could be replaced with (.+)((?:\s+\d+(?:.\d_))+)\s*$ to check for a list of ints or floats (path/to/image 123 4. 5.67).

That's pretty much the same as option (2) above.

@crohkohl
Copy link

Why not load caffe datum as defined in the protocol buffers and give it the file extension *.cdatum?

@lukeyeager
Copy link
Member Author

@crohkohl, thanks for the input! Are you suggesting replacing option (3), or creating another option?

I'd like to move away from Caffe-specific solutions where possible. You do bring up a good point, though, which is that Datum's can only be 3-dimensional, whereas numpy.ndarray's can be N-dimensional. So I guess we'd have to throw an error if you tried to supply a 4d label. That, or start supporting HDF5 in DIGITS.

@crohkohl
Copy link

Are you suggesting replacing option (3), or creating another option?

Well, I personally would choose a format that is independet from python for data import. I think a lot of people have tool chains other than python (like me). One could provide a custom protobuf format for example that could be exported from all supported languages.

I further would not only provide it as a label. It would also be great to actually load the data like that and not to be limited by standard 2d image formats like png. That way one could easily import 1d - 4d data and labels using the same mechanism into the database for caffe.

And isn't the caffe-blob structure N-D aswell? They have a N-D shape if I recall correctly.

@lukeyeager
Copy link
Member Author

Ok, that's two good suggestions:

  1. Non-python format
    • Good point, especially if we want to support Torch.
    • Custom protobuf doesn't like the right choice. Any other ideas?
  2. Non-image data
    • I had been thinking of this as a separate issue, with a different interface for creating video or audio datasets. Maybe they should all be merged into the same interface?
    • Could we ask the user which type of data they want and look for appropriate file extensions?

And isn't the caffe-blob structure N-D aswell? They have a N-D shape if I recall correctly.

BlobProto is N-D, but Datum is 3D, and Caffe's Data layer expects Datums in LMDB

@crohkohl
Copy link

Custom protobuf doesn't like the right choice. Any other ideas?

I think the only option that is simple and portable is a raw binary format enriched with meta data provided by the user, e.g.: shape, datatype, compression.

As a side node:

Whatever you choose to implement will probably not be sufficient. So it should be considered to provide a plugin-design for data parsing.

The import-plugin interface class registers certain extensions and handles the loading by returning a 3d datum and a nd-label. That way it could be easily extended by the community without having to interfer with a lot of code.

Further, in the gui the user could choose what importers shall be used. Upon selection a custom parameter dialog could be shown by the importer plugin.

@jms90h5
Copy link

jms90h5 commented Aug 15, 2015

I'm one of those users interested in "audio". I put audio in quotes because that general type of time series data can also encompass many other similar use cases. For example. EKG heart waveforms, EEG brain waveforms (which happens to be one of my use cases). Heck even sensor data from heavy equipment basically has the same nature of being a one dimensional signal and can be processed in the same way as audio.

I agree with crohkohl that some kind of plugin with documentation and perhaps an example implementation or two is probably the right way to go. Even for just actual audio there are a ton of formats and configuration variables. Furthermore, in keeping with the general-inference theme it would make sense for the work involved with this enhancement to also be used in a very general way. For example, in addition to the actual waveform samples, classification can be performed on feature vectors generated from them. The feature vectors are basically are just another representation of the same data (and still a sequence of numeric values). Currently I'm pretending the values in the feature vectors are pixels in an image to allow me to use Digits with little or no modification. But it would be nice to not have to do that.

In another discussion there is an explanation of how to generate LMDB's separately from Digits that caffe can operate on. While that is an OK workaround it defeats part of the convenience and low barrier for entry that Digits provides. While it's quite possible part of the operation of the plugin would be to call that same db creation code, it would be nice to at least have a bit of a wrapper around it to integrate it into the Digits UI. For example, some way to tell Digits to use a specific plugin rather than one of the existing input mechanisms.

This was referenced Aug 18, 2015
@lukeyeager
Copy link
Member Author

lukeyeager commented Sep 2, 2015

How about a CSV (or TSV) like this:

x,          y
(2),        (1)
ndarray,    value
[2,3],      0.6667
[5,2],      2.5
...

By row, that's the (1) name and (2) shape of the tensor, and then the (3) type and (4-*) values that will provide the data for the tensor. Here's how it would look for the classic image classification problem:

data,               label
(256,256,3),        (1)
file,               value
/images/cat/1.jpg,  0
/images/dog/1.png,  1
...

For segmentations:

image,              segmentations
(200,100),          (200,100,20)
file,               file
/data/1_img.png,    /data/1_seg.npy
/data/2_img.png,    /data/2_seg.npy
...

So, we'd have two types of plugins:

  1. Data type plugins (value,ndarray,file, etc.)
  2. File type plugins for different file extensions with the file data type (image, numpy, csv, etc.)

@FridayAccessory123
Copy link

@lukeyeager, for what it's worth I'd like to second @gheinrich's request above - it would be so simple if we could just specify a vector of labels in the text files! Are we likely to see this feature any time soon? Thanks.

@lukeyeager
Copy link
Member Author

@FridayAccessory123 ok, thanks for the feedback! Does the solution in my previous post (#197 (comment)) seem too complex? If you wanted bounding boxes instead of classifications, you could format the CSV like this:

image,               boxes
(256,256,3),        (2,2)
file,               ndarray
/images/scene1.jpg,  [[0.1,0.1],[0.3,0.3]]
/images/scene2.png,  [[0.2,0.4],[0.4,0.6]]
...

@FridayAccessory123
Copy link

@lukeyeager, not too complex at all - a simple vector of ints would be enough for me right now but your solution certainly looks like the way forward more generally. Thanks again.

@lukeyeager
Copy link
Member Author

/cc @Deepomatic

@thiagoribeirodamotta
Copy link

Are there any news on this?

@sodeypunk
Copy link

We really need the feature to support this. Would like some update around this as well!

@lukeyeager
Copy link
Member Author

Our main focus right now is around running DIGITS on a cluster of machines (related to #108). As a precursor, I'm working on moving all the stored data over to a SQL database instead of pickle files (#566). I'm intentionally keeping the database schema as data-agnostic as possible, and also as framework-agnostic as possible. So hopefully our work towards that major new feature will make generic datasets easier to implement in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants