-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for HDF5 datasets #226
Conversation
Now that this is integrated into DIGITS, I can run some proper speed and size tests (see original tests at #224).
Sizes are in MB and times are in seconds. |
Luke, do I understand correctly that those are the times to create a database? Do you think it would be interesting to also measure how long it takes to load a batch from the database (when the batch is cold in cache and when it is hot in cache)? Just to verify that loading the samples from the database is comfortably faster than the typical processing that follows. |
Yes, I'll post that information when I can. I have to knock out that last TODO before I can integrate HDF5 datasets into networks, and it's taking more work than I had hoped initially. |
Update: Caffe doesn't actually support integer data in HDF5 files (see BVLC/caffe#2978). Rather than make a change to Caffe and require DIGITS users to update to the newest version, I'll simply force float data for now and take the hit on filesize. |
Update: Caffe doesn't support the LZF compression filter - only raw and GZIP are supported.
|
@@ -24,6 +26,7 @@ | |||
import numpy as np | |||
import PIL.Image | |||
import lmdb | |||
import h5py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this need to be added to requirements.txt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, good point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(added in #247)
f67e414
to
187c78f
Compare
e1dcb00
to
3712ae2
Compare
Experiment 1
Experiment 2
Results
|
New problem: I have to cap the HDF5 files at no more than INT_MAX numbers per dataset. |
I suspect the reason HDF5Data is so much faster than Data is [ironically] because it loads the whole dataset at once (BVLC/caffe#2892) instead of prefetching. |
48ea27d
to
e631db3
Compare
It would be nice if HDF5 support could be exposed as an optional functionality of the underlying framework. Perhaps something similar to the API to tell whether the framework supports shuffling training data? |
You're right - we should definitely do that. Can we merge this (unless you have any other feedback) and do it in a later pull request? I know it's a bit of a step backwards in terms of framework-independence, but I'd like to get this off my plate. |
On e631db3 I had an error when trying to create an HDF5 database (MNIST with compression=GZIP)
I am fine with the idea of merging HDF5 support first even if this introduces a slightly digression from the "unified framework" model. Maybe a general comment is that it is not scalable to multiple DB backends (with some logic embedded in the HTML and caffe_train.py)... not to say that I could do better! |
logger.info('Reached HDF5 dataset size limit') | ||
db.close() | ||
db = _create_hdf5_db(output_dir, images_written, | ||
hdf5_dset_limit, compression, image_channels, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't you need to increase hdf5_dset_limit? How is this different from what you're doing on line 261? I am probably missing something...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm closing the first .h5
file and opening a new one. Caffe can open multiple .h5
files, but none of them can have a dataset "bigger" than INT_MAX
.
That's weird. What version of hdf5 do you have?
|
I have a slightly newer version of h5py and a slightly older version of HDF5.
I am not seeing the problem on 10f3cd2. Is e631db3 good to test? |
|
Ubuntu 14.04 was released 04/2014 and HDF5 1.8.4 was released 09/2009. It seems weird that you would have such an old version. How did you install HDF5 - do you remember? |
I have Ubuntu 12.04. Apparently the default HDF5 package (installed via
I did:
So apparently a newer version of hdf5 package should be installed manually? Sounds like a minor hassle, right? |
Ah, that explains it. We don't technically support 12.04, BTW.
That is a hassle. Let's not make this any harder than it needs to be. Let me put together a workaround ... |
I don't think there was any doubt about it but just in case: it is working on Ubuntu 14.04. |
Show correct filesize approximation for HDF5
Caffe imposes a limit on the shape of an HDF5 dataset. The product of the dimensions must be <= INT_MAX (2^31). To get around this, you have to create multiple HDF5 files and create a textfile which contains a list of them.
Ubuntu 12.04 comes with HDF5 1.8.4, whereas 14.04 comes with 1.8.11 > ## Release 1.8.7 of May 2011 versus Release 1.8.6 HDF5 now allows the size of any dataspace dimension to be 0 (zero). This was previously allowed only if the maximum size of the dimension was unlimited. https://www.hdfgroup.org/HDF5/doc/ADGuide/Changes_1_8_x.html
@gheinrich can you verify that the last patch makes this work for 12.04? |
Yes, now this is working on Ubuntu 12.04 with HDF5 1.8.4. Thanks! |
Add support for HDF5 datasets
Note that you need NVIDIA/caffe#26 (Included in |
Now that this PR is merged, is there somebody working on further support of HDF5 for non-image data? |
@aralph not actively right now. But we are re-evaluating our data format as we evaluate adding new DL frameworks. What's your use-case - why do you want better HDF5 support? |
@lukeyeager We work with multi-channel data in Caffe. We have faced some limitations with LMDB. Development, manipulation and storing/loading HDF5 turned out to be more flexible in HDF5. |
Being able to import prebuilt hdf5 dataset would be very helpful. One use case would be vector label, like in SVHN format 1 |
Closes #224
TODO before merge
HDF5Data
layersTODO after merge