1

I'm doing deep learning with caffe and generating my own dataset in HDF5 format. I have 131 976 images all 224x224 which come to about 480MB, and each image has a 1x6 array as a label. I've found that when I generate the .h5 files, they come to 5GB each, 125GB in total. I just want to make sure this is expected. I've checked the contents, but i don't understand how the memory requirement is 250 times bigger. All I'm doing is filling numpy arrays X and Y and creating the datasets (25 in total).

with h5py.File('/media/joe/SAMSUNG/GraspingData/HDF5/train'+str(j)+'.h5','w') as H:
    H.create_dataset( 'graspData', data=X)                      # note the name - give to layer
    H.create_dataset( 'graspLabel', data=Y) 
Mad Physicist
  • 107,652
  • 25
  • 181
  • 264
Joe Watson
  • 11
  • 1
  • 1
    What is the shape and type of your data (`X` and `Y`)? And ist this all the code that you have that generates the .h5 file? – Hannes Ovrén Nov 16 '15 at 16:30
  • Might just be a problem with your h5py library. I had similar problems using a given h5py library (probably fedora default one), and solved it by switching to anaconda. – Tony Nov 16 '15 at 16:52
  • @HannesOvrén X is (53125,3,224,224) per batch (224x224 RGB images). The code before my snippet is just filling an array of the above size. – Joe Watson Nov 16 '15 at 17:22
  • What's the type? Because if X is 8 bytes per element then that's almost 60GB. If Y is the same size then you get close to your 125 GB. If it's normal images the type should be `uint8`. – Hannes Ovrén Nov 16 '15 at 17:51
  • @JoeWatson images takes only 480MB because they are saved in a compressed format (e.g., jpg/png) while data in HDF5 is uncompressed. the amout of space you need for storing a single image in float32 data type (type accepted by caffe) is 3*224*244*4=~600KB (!) – Shai Nov 17 '15 at 06:36
  • To follow up on what @Shai says, while compression is not enabled by default, all you have to do to enable it is to pass the keyword `compression="gzip"` to `H.create_dataset()`. And as @Hannes points out, you may also want `dtype="uint8"`. – Yossarian Nov 17 '15 at 09:45
  • @Yossarian I'm not sure `uint8` is desirable, as caffe expects its input to be in `float32` format. – Shai Nov 17 '15 at 10:38

0 Answers0