I have a large dataset which makes my lmdb huge. For 16.000 samples my database is already 20 GB. But in total I have 800.000 images which would end up in a huge amount of data. Is there any way to compress an lmdb? Or is it better to use HDF5 files? I would like to know if anyone knows probably the best solution for this problem.
-
did you convert image using caffe's `convert_imageset`? if yes, did you use `--encoded` parameter? – nomem Apr 04 '17 at 19:40
-
No, I am using my own python code to do so, since I have to alter and reshape my data. @Inman – Apr 05 '17 at 10:06
-
How would you encode the files programmatically? What I do is: `vtxn.put('{:0>10d}'.format(in_idx), datum.SerializeToString())`. But I think it is not possible to "compress" the SerializeToString() method? @Inman – Apr 05 '17 at 11:32
-
i don't think you need to compress `SerializeToString()`. Rather you need to set datum to jpg/png data and set encode flag. For detail see `io.cpp`. – nomem Apr 06 '17 at 20:06
-
@Inman IMHO you should write your last comment as an answer so I can give you credit for your nice help! I think that is the answer I was looking for! – Apr 10 '17 at 12:14
-
ok. added answer. – nomem Apr 10 '17 at 13:27
2 Answers
If you look inside ReadImageToDatum
function in io.cpp
it can keep image in both compressed(jpg/png) format or raw format. To use compressed format you can compress the loaded image using cv::imencode
. Now you just set the datum to the compressed data and set the encoded
flag. Then you can store the datum in lmdb
.

- 1,568
- 4
- 17
- 33
-
`datum->set_encoded(true);` as used in https://github.com/BVLC/caffe/blob/master/src/caffe/util/io.cpp#L133 – nomem Apr 12 '17 at 18:41
-
Another good option is to use hdf5 files since one can compress them. What do you think is the better solution? Since I have images a ground truth labels anyway I think it might be better to store my data as hdf5 files. @Inman – Apr 12 '17 at 21:00
-
@thigi It entirely depend on your configuration. there is no benchmark of lmdb vs hdf5. If data prefetching works ok for you then you can use any format of your convenience. – nomem Apr 13 '17 at 12:47
There are various techniques to reduce input size, but much of that depends on your application. For instance, the ILSVRC-2012 data set images can be resized to about 256x256 pixels without nasty effects on the training time or model accuracy. This reduces the data set from 240Gb to 40Gb. Can your data set suffer loss of fidelity from simple "physical" compression? How small do you have to have the data set?
I'm afraid that I haven't worked with HDF5 files enough to have an informed opinion.

- 76,765
- 14
- 60
- 81
-
I want it to be as small as possible and as fast as possible to create. So, I probably have to find the best trade-off between these two. I am trying to create HDF5 files at the moment where you can set a compression param. However, obviously the better the compression the longer the computation time to create the dataset. I have my own dataset in form of pngs. I need to store them in either a lmdb or hdf5 or anything else which **caffe** accepts. But I cannot use the raw pictures itself since I have to process them first. 4GB of my raw images result in 20GB when transferred to lmdb @Prune – Apr 05 '17 at 09:41
-
You ignored my first question and replaced the second with an unmeasurable "best trade-off". This leaves me with nothing to add to the discussion. – Prune Apr 05 '17 at 16:26
-
Well, my dataset is already compressed. The question was how can I store the dataset compressed. When I use lmdb my dataset which was compressed before gets much bigger, since lmdb does not use any kind of compression. Therefore, I was a bit confused about your question. @Prune – Apr 09 '17 at 11:52
-
I'm a bit confused about yours: "Is there any way to compress an lmdb?" If you you know that LMDB isn't compressed, but your data is already compressed ... that's where we lost contact. – Prune Apr 10 '17 at 16:18
-
Of course, you can tar & compress an LMDB directory -- but it doesn't do much for image files; I get about 8-10% compression. That's why I asked whether you required lossless compression; simply downsizing to a standard size saves a lot of space and some time. – Prune Apr 10 '17 at 16:19
-
Okay, thank you! Inman mentioned that one can compress images in a lmdb. Take a look at his answer, if you are interested too! @Prune – Apr 11 '17 at 14:54