Add support for HDF5 datasets #226

lukeyeager · 2015-08-20T01:29:04Z

Closes #224

TODO before merge

Create models from HDF5 datasets using HDF5Data layers
Expose backend and compression information in REST API
Shard HDF5 files into acceptable dataset sizes - Caffe cannot handle HDF5 files larger as large as 20GB? BVLC/caffe#2953 (comment)

TODO after merge

Allow non-image data (see Choosing data format[s] for creating generic inference datasets #197)
Analyze prebuilt HDF5 datasets in "generic" path

lukeyeager · 2015-08-20T01:41:38Z

Now that this is integrated into DIGITS, I can run some proper speed and size tests (see original tests at #224).

MNIST train imageset
- 60,000 images
- 28x28 grayscale
ImageNet-like imageset
- 3,676 images
- 256x256 color

Backend	Image Encoding	Database Compression	MNIST size	MNIST time	ImageNet size	ImageNet time
LMDB			59	60	704	20
LMDB	PNG		22	56	323	73
LMDB	JPEG (lossy)		57	57	66	21
HDF5			46	59	697	18
HDF5		LZF	14	59	618	24
HDF5		GZIP	11	60	551	45

Sizes are in MB and times are in seconds.

gheinrich · 2015-08-24T12:18:00Z

Luke, do I understand correctly that those are the times to create a database? Do you think it would be interesting to also measure how long it takes to load a batch from the database (when the batch is cold in cache and when it is hot in cache)? Just to verify that loading the samples from the database is comfortably faster than the typical processing that follows.

lukeyeager · 2015-08-24T17:52:52Z

Yes, I'll post that information when I can. I have to knock out that last TODO before I can integrate HDF5 datasets into networks, and it's taking more work than I had hoped initially.

lukeyeager · 2015-08-26T22:55:29Z

Update: Caffe doesn't actually support integer data in HDF5 files (see BVLC/caffe#2978).

Rather than make a change to Caffe and require DIGITS users to update to the newest version, I'll simply force float data for now and take the hit on filesize.

lukeyeager · 2015-08-26T23:10:19Z

Update: Caffe doesn't support the LZF compression filter - only raw and GZIP are supported.

HDF5-DIAG: Error detected in HDF5 (1.8.11) thread 140259111430720:
#000: ../../../src/H5Dio.c line 182 in H5Dread(): can't read data
major: Dataset
minor: Read failed
#001: ../../../src/H5Dio.c line 550 in H5D__read(): can't read data
major: Dataset
minor: Read failed
#002: ../../../src/H5Dchunk.c line 1837 in H5D__chunk_read(): unable to read raw data chunk
major: Low-level I/O
minor: Read failed
#003: ../../../src/H5Dchunk.c line 2868 in H5D__chunk_lock(): data pipeline read failed
major: Data filters
minor: Filter operation failed
#004: ../../../src/H5Z.c line 1150 in H5Z_pipeline(): required filter 'lzf' is not registered
major: Data filters
minor: Read failed
#005: ../../../src/H5PL.c line 293 in H5PL_load(): search in paths failed
major: Plugin for dynamically loaded library
minor: Can't get value
#006: ../../../src/H5PL.c line 397 in H5PL__find(): can't open directory
major: Plugin for dynamically loaded library
minor: Can't open directory or file
F0826 16:08:40.263033   752 io.cpp:268] Check failed: status >= 0 (-1 vs. 0) Failed to read float dataset data

gheinrich · 2015-08-27T11:30:36Z

tools/create_db.py

@@ -24,6 +26,7 @@
 import numpy as np
 import PIL.Image
 import lmdb
+import h5py


does this need to be added to requirements.txt?

Yep, good point.

(added in #247)

lukeyeager · 2015-09-02T22:13:04Z

Experiment 1

MNIST dataset
- 60,000 images in 10 classes
- 28x28 grayscale
LeNet network
- 3 epochs

Backend	Image encoding	Database compression	Dataset filesize (MB)	Dataset creation time (sec)	Model training time (sec)
LMDB			59	60	54
LMDB	PNG		22	56	53
LMDB	JPEG (lossy)		57	57	54
HDF5			181	62	19
HDF5		GZIP	17	62	18

Experiment 2

ImageNet-like dataset
- 3,676 images in 2 classes
- 256x256 color
AlexNet network
- 3 epochs

Backend	Image encoding	Database compression	Dataset filesize (MB)	Dataset creation time (sec)	Model training time (sec)
LMDB			704	20	60
LMDB	PNG		323	73	60
LMDB	JPEG (lossy)		66	21	60
HDF5			2800	23	31
HDF5		GZIP	772	88	33

Results

HDF5 has a lot of filesize bloat (see Allow H5T_INTEGER in HDF5 files BVLC/caffe#2978)
HDF5Data seems to be about 2x faster than Data for data reads (!)
- I'm going to run some more experiments to figure out what's going on here

lukeyeager · 2015-09-02T23:42:01Z

New problem: I have to cap the HDF5 files at no more than INT_MAX numbers per dataset.

See BVLC/caffe#2953 (comment)

lukeyeager · 2015-09-03T00:19:09Z

I suspect the reason HDF5Data is so much faster than Data is [ironically] because it loads the whole dataset at once (BVLC/caffe#2892) instead of prefetching.

lukeyeager · 2015-09-10T00:37:11Z

I think I've knocked out all the gotchas. And I put in a warning box explaining what is not currently supported:

gheinrich · 2015-09-10T16:00:30Z

It would be nice if HDF5 support could be exposed as an optional functionality of the underlying framework. Perhaps something similar to the API to tell whether the framework supports shuffling training data?

lukeyeager · 2015-09-10T17:54:43Z

You're right - we should definitely do that. Can we merge this (unless you have any other feedback) and do it in a later pull request? I know it's a bit of a step backwards in terms of framework-independence, but I'd like to get this off my plate.

gheinrich · 2015-09-11T14:44:53Z

On e631db3 I had an error when trying to create an HDF5 database (MNIST with compression=GZIP)

2015-09-11 15:51:07 [20150911-155103-f98f] [WARNING] Create DB (train) unrecognized output: sid = h5s.create_simple(shape, maxshape)
2015-09-11 15:51:07 [20150911-155103-f98f] [WARNING] Create DB (train) unrecognized output: File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper (/tmp/pip-build-JuupVN/h5py/h5py/_objects.c:2467)
2015-09-11 15:51:07 [20150911-155103-f98f] [WARNING] Create DB (train) unrecognized output: File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper (/tmp/pip-build-JuupVN/h5py/h5py/_objects.c:2424)
2015-09-11 15:51:07 [20150911-155103-f98f] [WARNING] Create DB (train) unrecognized output: File "h5py/h5s.pyx", line 99, in h5py.h5s.create_simple (/tmp/pip-build-JuupVN/h5py/h5py/h5s.c:1416)
2015-09-11 15:51:07 [20150911-155103-f98f] [WARNING] Create DB (train) unrecognized output: ValueError: Zero sized dimension for non-unlimited dimension (Zero sized dimension for non-unlimited dimension)

I am fine with the idea of merging HDF5 support first even if this introduces a slightly digression from the "unified framework" model. Maybe a general comment is that it is not scalable to multiple DB backends (with some logic embedded in the HTML and caffe_train.py)... not to say that I could do better!

gheinrich · 2015-09-11T14:53:16Z

tools/create_db.py

+ logger.info('Reached HDF5 dataset size limit')
+ db.close()
+ db = _create_hdf5_db(output_dir, images_written,
+ hdf5_dset_limit, compression, image_channels,


don't you need to increase hdf5_dset_limit? How is this different from what you're doing on line 261? I am probably missing something...

I'm closing the first .h5 file and opening a new one. Caffe can open multiple .h5 files, but none of them can have a dataset "bigger" than INT_MAX.

lukeyeager · 2015-09-11T16:05:02Z

On e631db3 I had an error when trying to create an HDF5 database (MNIST with compression=GZIP)

That's weird. What version of hdf5 do you have?

$ python -c "import h5py; print h5py.__doc__"

    This is the h5py package, a Python interface to the HDF5
    scientific data format.

    Version 2.2.1

    HDF5 1.8.11

gheinrich · 2015-09-11T18:55:42Z

I have a slightly newer version of h5py and a slightly older version of HDF5.

(venv)gheinrich@android-devel-wks-7:/fast-scratch/gheinrich/ws/digits$ python -c "import h5py; print h5py.__doc__"

    This is the h5py package, a Python interface to the HDF5
    scientific data format.

    Version 2.5.0

    HDF5 1.8.4

I am not seeing the problem on 10f3cd2. Is e631db3 good to test?

lukeyeager · 2015-09-11T19:15:50Z

Release 1.8.7 of May 2011 versus Release 1.8.6

HDF5 now allows the size of any dataspace dimension to be 0 (zero). This was previously allowed only if the maximum size of the dimension was unlimited.
https://www.hdfgroup.org/HDF5/doc/ADGuide/Changes_1_8_x.html

lukeyeager · 2015-09-11T19:21:14Z

Ubuntu 14.04 was released 04/2014 and HDF5 1.8.4 was released 09/2009. It seems weird that you would have such an old version. How did you install HDF5 - do you remember?

gheinrich · 2015-09-11T20:06:23Z

I have Ubuntu 12.04. Apparently the default HDF5 package (installed via apt-get libhdf5-serial-dev) is 1.8.4.

$ apt-cache show libhdf5-serial-dev
Package: libhdf5-serial-dev
Priority: optional
Section: universe/libdevel
Installed-Size: 16347
Maintainer: Ubuntu Developers <[email protected]>
Original-Maintainer: Debian GIS Project <[email protected]>
Architecture: amd64
Source: hdf5
Version: 1.8.4-patch1-3ubuntu2

I did:

apt-get update
apt-get install --only-upgrade libhdf5-serial-dev
...
libhdf5-serial-dev is already the newest version.

So apparently a newer version of hdf5 package should be installed manually? Sounds like a minor hassle, right?

lukeyeager · 2015-09-11T20:19:33Z

I have Ubuntu 12.04

Ah, that explains it. We don't technically support 12.04, BTW.

So apparently a newer version of hdf5 package should be installed manually? Sounds like a minor hassle, right?

That is a hassle. Let's not make this any harder than it needs to be. Let me put together a workaround ...

gheinrich · 2015-09-15T14:14:30Z

I don't think there was any doubt about it but just in case: it is working on Ubuntu 14.04.
By the way do you think it would be useful to add a little badge in front of each dataset on the main page to show which database backend was used? (similar to the model badge showing the DL framework id)

Show correct filesize approximation for HDF5

Caffe imposes a limit on the shape of an HDF5 dataset. The product of the dimensions must be <= INT_MAX (2^31). To get around this, you have to create multiple HDF5 files and create a textfile which contains a list of them.

Ubuntu 12.04 comes with HDF5 1.8.4, whereas 14.04 comes with 1.8.11 > ## Release 1.8.7 of May 2011 versus Release 1.8.6 HDF5 now allows the size of any dataspace dimension to be 0 (zero). This was previously allowed only if the maximum size of the dimension was unlimited. https://www.hdfgroup.org/HDF5/doc/ADGuide/Changes_1_8_x.html

lukeyeager · 2015-09-15T21:38:29Z

@gheinrich can you verify that the last patch makes this work for 12.04?

gheinrich · 2015-09-15T22:13:57Z

Yes, now this is working on Ubuntu 12.04 with HDF5 1.8.4. Thanks!

Add support for HDF5 datasets

lukeyeager · 2015-09-15T22:22:55Z

Note that you need NVIDIA/caffe#26 (Included in v0.13.2) to avoid "CNMEM_NOT_INITIALIZED" errors.

aralph · 2016-10-07T16:22:46Z

Now that this PR is merged, is there somebody working on further support of HDF5 for non-image data?

lukeyeager · 2016-10-10T17:23:49Z

@aralph not actively right now. But we are re-evaluating our data format as we evaluate adding new DL frameworks. What's your use-case - why do you want better HDF5 support?

aralph · 2016-10-11T16:58:51Z

@lukeyeager We work with multi-channel data in Caffe. We have faced some limitations with LMDB. Development, manipulation and storing/loading HDF5 turned out to be more flexible in HDF5.
HDF5 also lets you store metadata in the database which does not seem to work in LMDB.

spotofleopard · 2017-05-17T21:33:50Z

Being able to import prebuilt hdf5 dataset would be very helpful. One use case would be vector label, like in SVHN format 1

lukeyeager mentioned this pull request Aug 20, 2015

Filtering by subfolder option in parse_folder script #215

Closed

lukeyeager mentioned this pull request Aug 26, 2015

Allow H5T_INTEGER in HDF5 files BVLC/caffe#2978

Merged

lukeyeager force-pushed the hdf5 branch from ea70a27 to cbe5a2f Compare August 26, 2015 02:24

gheinrich reviewed Aug 27, 2015
View reviewed changes

lukeyeager force-pushed the hdf5 branch 2 times, most recently from f67e414 to 187c78f Compare August 28, 2015 22:19

lukeyeager mentioned this pull request Aug 28, 2015

Add HDF5 support to create_db.py #245

Merged

lukeyeager force-pushed the hdf5 branch 4 times, most recently from e1dcb00 to 3712ae2 Compare September 2, 2015 20:56

lukeyeager force-pushed the hdf5 branch 3 times, most recently from 48ea27d to e631db3 Compare September 10, 2015 00:36

lukeyeager mentioned this pull request Sep 10, 2015

Support HDF5 #224

Closed

gheinrich reviewed Sep 11, 2015
View reviewed changes

lukeyeager added 4 commits September 15, 2015 14:06

Expose DB backend option in web UI

42eb804

Show correct filesize approximation for HDF5

Add CreateDbTask info to REST API for datasets

a4325b2

Add partial support for HDF5Data layer in networks

57451e5

Shard HDF5 files into acceptable dataset sizes

ef9f0ea

Caffe imposes a limit on the shape of an HDF5 dataset. The product of the dimensions must be <= INT_MAX (2^31). To get around this, you have to create multiple HDF5 files and create a textfile which contains a list of them.

lukeyeager force-pushed the hdf5 branch from e631db3 to ef9f0ea Compare September 15, 2015 21:06

lukeyeager added a commit that referenced this pull request Sep 15, 2015

Merge pull request #226 from lukeyeager/hdf5

6e8abe4

Add support for HDF5 datasets

lukeyeager merged commit 6e8abe4 into NVIDIA:master Sep 15, 2015

lukeyeager deleted the hdf5 branch September 15, 2015 22:18

This was referenced Sep 17, 2015

Fix data sharding in HDF5 files #313

Merged

Add classification dataset view tests for HDF5 #316

Merged

lukeyeager added the enhancement label Sep 17, 2015

This was referenced Sep 29, 2015

Support for train data augmentation #330

Closed

Initial support for Torch7 in DIGITS #324

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for HDF5 datasets #226

Add support for HDF5 datasets #226

lukeyeager commented Aug 20, 2015

lukeyeager commented Aug 20, 2015

gheinrich commented Aug 24, 2015

lukeyeager commented Aug 24, 2015

lukeyeager commented Aug 26, 2015

lukeyeager commented Aug 26, 2015

gheinrich Aug 27, 2015

lukeyeager Aug 27, 2015

lukeyeager Sep 10, 2015

lukeyeager commented Sep 2, 2015

lukeyeager commented Sep 2, 2015

lukeyeager commented Sep 3, 2015

lukeyeager commented Sep 10, 2015

gheinrich commented Sep 10, 2015

lukeyeager commented Sep 10, 2015

gheinrich commented Sep 11, 2015

gheinrich Sep 11, 2015

lukeyeager Sep 11, 2015

lukeyeager commented Sep 11, 2015

gheinrich commented Sep 11, 2015

lukeyeager commented Sep 11, 2015

Release 1.8.7 of May 2011 versus Release 1.8.6

lukeyeager commented Sep 11, 2015

gheinrich commented Sep 11, 2015

lukeyeager commented Sep 11, 2015

gheinrich commented Sep 15, 2015

lukeyeager commented Sep 15, 2015

gheinrich commented Sep 15, 2015

lukeyeager commented Sep 15, 2015

aralph commented Oct 7, 2016

lukeyeager commented Oct 10, 2016

aralph commented Oct 11, 2016

spotofleopard commented May 17, 2017

Add support for HDF5 datasets #226

Add support for HDF5 datasets #226

Conversation

lukeyeager commented Aug 20, 2015

TODO before merge

TODO after merge

lukeyeager commented Aug 20, 2015

gheinrich commented Aug 24, 2015

lukeyeager commented Aug 24, 2015

lukeyeager commented Aug 26, 2015

lukeyeager commented Aug 26, 2015

gheinrich Aug 27, 2015

Choose a reason for hiding this comment

lukeyeager Aug 27, 2015

Choose a reason for hiding this comment

lukeyeager Sep 10, 2015

Choose a reason for hiding this comment

lukeyeager commented Sep 2, 2015

Experiment 1

Experiment 2

Results

lukeyeager commented Sep 2, 2015

lukeyeager commented Sep 3, 2015

lukeyeager commented Sep 10, 2015

gheinrich commented Sep 10, 2015

lukeyeager commented Sep 10, 2015

gheinrich commented Sep 11, 2015

gheinrich Sep 11, 2015

Choose a reason for hiding this comment

lukeyeager Sep 11, 2015

Choose a reason for hiding this comment

lukeyeager commented Sep 11, 2015

gheinrich commented Sep 11, 2015

lukeyeager commented Sep 11, 2015

Release 1.8.7 of May 2011 versus Release 1.8.6

lukeyeager commented Sep 11, 2015

gheinrich commented Sep 11, 2015

lukeyeager commented Sep 11, 2015

gheinrich commented Sep 15, 2015

lukeyeager commented Sep 15, 2015

gheinrich commented Sep 15, 2015

lukeyeager commented Sep 15, 2015

aralph commented Oct 7, 2016

lukeyeager commented Oct 10, 2016

aralph commented Oct 11, 2016

spotofleopard commented May 17, 2017