Hdf5 and pickle takes more space than raw csv file

Question

I have a csv file (containing only numeric data) of size 18 MB. When I read it and convert to numpy array and save it in hdf5 format or pickle , it takes around 48 MB disk space. Shouldn't the data be compressed when we use pickle or hdf5? Is it better to save in hdf5 format to be consumed by tensorflow ? The Csv data is of the form

2,3,66,184,2037,43312,0,0,9,2,0,1,8745,1,0,2,6,204,27,97
2,3,66,184,2037,43312,0,0,9,2,0,1,8745,1,0,2,6,204,27,78
2,3,66,184,2037,43312,0,0,9,2,0,1,8745,1,0,1,6,204,27,58

Dimension of the data is 310584 X 20

So the HDF5 and pickle both take around 48MB of disk space? What is the dimension and type of the dataset? Also if you can post a few lines of the csv that might be helpful. — John Readey, Jun 27 '16 at 17:04

Himaprasoon · Accepted Answer · 2016-06-28T07:50:52.587

2

Numpy array for integers defaults to int64 dtype . This was the reason the data was taking more space than the original.

310584 X 20 x 8 ~= 48 MB (8 Bytes is the size for int64)

edited Jun 28 '16 at 07:50

answered Jun 28 '16 at 06:35

Himaprasoon

2,609
3
25
46

Hdf5 and pickle takes more space than raw csv file

1 Answers1

Linked