Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for HDF5 datasets #226

Merged
merged 5 commits into from
Sep 15, 2015
Merged

Add support for HDF5 datasets #226

merged 5 commits into from
Sep 15, 2015

Conversation

lukeyeager
Copy link
Member

Closes #224

TODO before merge

TODO after merge

@lukeyeager
Copy link
Member Author

Now that this is integrated into DIGITS, I can run some proper speed and size tests (see original tests at #224).

  • MNIST train imageset
    • 60,000 images
    • 28x28 grayscale
  • ImageNet-like imageset
    • 3,676 images
    • 256x256 color
Backend Image Encoding Database Compression MNIST size MNIST time ImageNet size ImageNet time
LMDB 59 60 704 20
LMDB PNG 22 56 323 73
LMDB JPEG (lossy) 57 57 66 21
HDF5 46 59 697 18
HDF5 LZF 14 59 618 24
HDF5 GZIP 11 60 551 45

Sizes are in MB and times are in seconds.

@gheinrich
Copy link
Contributor

Luke, do I understand correctly that those are the times to create a database? Do you think it would be interesting to also measure how long it takes to load a batch from the database (when the batch is cold in cache and when it is hot in cache)? Just to verify that loading the samples from the database is comfortably faster than the typical processing that follows.

@lukeyeager
Copy link
Member Author

Yes, I'll post that information when I can. I have to knock out that last TODO before I can integrate HDF5 datasets into networks, and it's taking more work than I had hoped initially.

@lukeyeager
Copy link
Member Author

Update: Caffe doesn't actually support integer data in HDF5 files (see BVLC/caffe#2978).

Rather than make a change to Caffe and require DIGITS users to update to the newest version, I'll simply force float data for now and take the hit on filesize.

@lukeyeager
Copy link
Member Author

Update: Caffe doesn't support the LZF compression filter - only raw and GZIP are supported.

HDF5-DIAG: Error detected in HDF5 (1.8.11) thread 140259111430720:
#000: ../../../src/H5Dio.c line 182 in H5Dread(): can't read data
major: Dataset
minor: Read failed
#001: ../../../src/H5Dio.c line 550 in H5D__read(): can't read data
major: Dataset
minor: Read failed
#002: ../../../src/H5Dchunk.c line 1837 in H5D__chunk_read(): unable to read raw data chunk
major: Low-level I/O
minor: Read failed
#003: ../../../src/H5Dchunk.c line 2868 in H5D__chunk_lock(): data pipeline read failed
major: Data filters
minor: Filter operation failed
#004: ../../../src/H5Z.c line 1150 in H5Z_pipeline(): required filter 'lzf' is not registered
major: Data filters
minor: Read failed
#005: ../../../src/H5PL.c line 293 in H5PL_load(): search in paths failed
major: Plugin for dynamically loaded library
minor: Can't get value
#006: ../../../src/H5PL.c line 397 in H5PL__find(): can't open directory
major: Plugin for dynamically loaded library
minor: Can't open directory or file
F0826 16:08:40.263033   752 io.cpp:268] Check failed: status >= 0 (-1 vs. 0) Failed to read float dataset data

@@ -24,6 +26,7 @@
import numpy as np
import PIL.Image
import lmdb
import h5py
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this need to be added to requirements.txt?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, good point.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(added in #247)

@lukeyeager
Copy link
Member Author

Experiment 1

  • MNIST dataset
    • 60,000 images in 10 classes
    • 28x28 grayscale
  • LeNet network
    • 3 epochs
Backend Image encoding Database compression Dataset filesize (MB) Dataset creation time (sec) Model training time (sec)
LMDB 59 60 54
LMDB PNG 22 56 53
LMDB JPEG (lossy) 57 57 54
HDF5 181 62 19
HDF5 GZIP 17 62 18

Experiment 2

  • ImageNet-like dataset
    • 3,676 images in 2 classes
    • 256x256 color
  • AlexNet network
    • 3 epochs
Backend Image encoding Database compression Dataset filesize (MB) Dataset creation time (sec) Model training time (sec)
LMDB 704 20 60
LMDB PNG 323 73 60
LMDB JPEG (lossy) 66 21 60
HDF5 2800 23 31
HDF5 GZIP 772 88 33

Results

  1. HDF5 has a lot of filesize bloat (see Allow H5T_INTEGER in HDF5 files BVLC/caffe#2978)
  2. HDF5Data seems to be about 2x faster than Data for data reads (!)
    • I'm going to run some more experiments to figure out what's going on here

@lukeyeager
Copy link
Member Author

New problem: I have to cap the HDF5 files at no more than INT_MAX numbers per dataset.

See BVLC/caffe#2953 (comment)

@lukeyeager
Copy link
Member Author

I suspect the reason HDF5Data is so much faster than Data is [ironically] because it loads the whole dataset at once (BVLC/caffe#2892) instead of prefetching.

@lukeyeager
Copy link
Member Author

I think I've knocked out all the gotchas. And I put in a warning box explaining what is not currently supported:
hdf5-warnings

@lukeyeager lukeyeager mentioned this pull request Sep 10, 2015
@gheinrich
Copy link
Contributor

It would be nice if HDF5 support could be exposed as an optional functionality of the underlying framework. Perhaps something similar to the API to tell whether the framework supports shuffling training data?

@lukeyeager
Copy link
Member Author

You're right - we should definitely do that. Can we merge this (unless you have any other feedback) and do it in a later pull request? I know it's a bit of a step backwards in terms of framework-independence, but I'd like to get this off my plate.

@gheinrich
Copy link
Contributor

On e631db3 I had an error when trying to create an HDF5 database (MNIST with compression=GZIP)

2015-09-11 15:51:07 [20150911-155103-f98f] [WARNING] Create DB (train) unrecognized output: sid = h5s.create_simple(shape, maxshape)
2015-09-11 15:51:07 [20150911-155103-f98f] [WARNING] Create DB (train) unrecognized output: File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper (/tmp/pip-build-JuupVN/h5py/h5py/_objects.c:2467)
2015-09-11 15:51:07 [20150911-155103-f98f] [WARNING] Create DB (train) unrecognized output: File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper (/tmp/pip-build-JuupVN/h5py/h5py/_objects.c:2424)
2015-09-11 15:51:07 [20150911-155103-f98f] [WARNING] Create DB (train) unrecognized output: File "h5py/h5s.pyx", line 99, in h5py.h5s.create_simple (/tmp/pip-build-JuupVN/h5py/h5py/h5s.c:1416)
2015-09-11 15:51:07 [20150911-155103-f98f] [WARNING] Create DB (train) unrecognized output: ValueError: Zero sized dimension for non-unlimited dimension (Zero sized dimension for non-unlimited dimension)

I am fine with the idea of merging HDF5 support first even if this introduces a slightly digression from the "unified framework" model. Maybe a general comment is that it is not scalable to multiple DB backends (with some logic embedded in the HTML and caffe_train.py)... not to say that I could do better!

logger.info('Reached HDF5 dataset size limit')
db.close()
db = _create_hdf5_db(output_dir, images_written,
hdf5_dset_limit, compression, image_channels,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't you need to increase hdf5_dset_limit? How is this different from what you're doing on line 261? I am probably missing something...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm closing the first .h5 file and opening a new one. Caffe can open multiple .h5 files, but none of them can have a dataset "bigger" than INT_MAX.

@lukeyeager
Copy link
Member Author

On e631db3 I had an error when trying to create an HDF5 database (MNIST with compression=GZIP)

That's weird. What version of hdf5 do you have?

$ python -c "import h5py; print h5py.__doc__"

    This is the h5py package, a Python interface to the HDF5
    scientific data format.

    Version 2.2.1

    HDF5 1.8.11

@gheinrich
Copy link
Contributor

I have a slightly newer version of h5py and a slightly older version of HDF5.

(venv)gheinrich@android-devel-wks-7:/fast-scratch/gheinrich/ws/digits$ python -c "import h5py; print h5py.__doc__"

    This is the h5py package, a Python interface to the HDF5
    scientific data format.

    Version 2.5.0

    HDF5 1.8.4

I am not seeing the problem on 10f3cd2. Is e631db3 good to test?

@lukeyeager
Copy link
Member Author

Release 1.8.7 of May 2011 versus Release 1.8.6

HDF5 now allows the size of any dataspace dimension to be 0 (zero). This was previously allowed only if the maximum size of the dimension was unlimited.
https://www.hdfgroup.org/HDF5/doc/ADGuide/Changes_1_8_x.html

@lukeyeager
Copy link
Member Author

Ubuntu 14.04 was released 04/2014 and HDF5 1.8.4 was released 09/2009. It seems weird that you would have such an old version. How did you install HDF5 - do you remember?

@gheinrich
Copy link
Contributor

I have Ubuntu 12.04. Apparently the default HDF5 package (installed via apt-get libhdf5-serial-dev) is 1.8.4.

$ apt-cache show libhdf5-serial-dev
Package: libhdf5-serial-dev
Priority: optional
Section: universe/libdevel
Installed-Size: 16347
Maintainer: Ubuntu Developers <[email protected]>
Original-Maintainer: Debian GIS Project <[email protected]>
Architecture: amd64
Source: hdf5
Version: 1.8.4-patch1-3ubuntu2

I did:

apt-get update
apt-get install --only-upgrade libhdf5-serial-dev
...
libhdf5-serial-dev is already the newest version.

So apparently a newer version of hdf5 package should be installed manually? Sounds like a minor hassle, right?

@lukeyeager
Copy link
Member Author

I have Ubuntu 12.04

Ah, that explains it. We don't technically support 12.04, BTW.

So apparently a newer version of hdf5 package should be installed manually? Sounds like a minor hassle, right?

That is a hassle. Let's not make this any harder than it needs to be. Let me put together a workaround ...

@gheinrich
Copy link
Contributor

I don't think there was any doubt about it but just in case: it is working on Ubuntu 14.04.
By the way do you think it would be useful to add a little badge in front of each dataset on the main page to show which database backend was used? (similar to the model badge showing the DL framework id)

Show correct filesize approximation for HDF5
Caffe imposes a limit on the shape of an HDF5 dataset. The product of
the dimensions must be <= INT_MAX (2^31). To get around this, you have
to create multiple HDF5 files and create a textfile which contains a
list of them.
Ubuntu 12.04 comes with HDF5 1.8.4, whereas 14.04 comes with 1.8.11

> ## Release 1.8.7 of May 2011 versus Release 1.8.6
  HDF5 now allows the size of any dataspace dimension to be 0 (zero).
  This was previously allowed only if the maximum size of the dimension
  was unlimited.
  https://www.hdfgroup.org/HDF5/doc/ADGuide/Changes_1_8_x.html
@lukeyeager
Copy link
Member Author

@gheinrich can you verify that the last patch makes this work for 12.04?

@gheinrich
Copy link
Contributor

Yes, now this is working on Ubuntu 12.04 with HDF5 1.8.4. Thanks!

lukeyeager added a commit that referenced this pull request Sep 15, 2015
Add support for HDF5 datasets
@lukeyeager lukeyeager merged commit 6e8abe4 into NVIDIA:master Sep 15, 2015
@lukeyeager lukeyeager deleted the hdf5 branch September 15, 2015 22:18
@lukeyeager
Copy link
Member Author

Note that you need NVIDIA/caffe#26 (Included in v0.13.2) to avoid "CNMEM_NOT_INITIALIZED" errors.

@aralph
Copy link

aralph commented Oct 7, 2016

Now that this PR is merged, is there somebody working on further support of HDF5 for non-image data?

@lukeyeager
Copy link
Member Author

@aralph not actively right now. But we are re-evaluating our data format as we evaluate adding new DL frameworks. What's your use-case - why do you want better HDF5 support?

@aralph
Copy link

aralph commented Oct 11, 2016

@lukeyeager We work with multi-channel data in Caffe. We have faced some limitations with LMDB. Development, manipulation and storing/loading HDF5 turned out to be more flexible in HDF5.
HDF5 also lets you store metadata in the database which does not seem to work in LMDB.

@spotofleopard
Copy link

Being able to import prebuilt hdf5 dataset would be very helpful. One use case would be vector label, like in SVHN format 1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants