2

I have a csv file (containing only numeric data) of size 18 MB. When I read it and convert to numpy array and save it in hdf5 format or pickle , it takes around 48 MB disk space. Shouldn't the data be compressed when we use pickle or hdf5? Is it better to save in hdf5 format to be consumed by tensorflow ? The Csv data is of the form

2,3,66,184,2037,43312,0,0,9,2,0,1,8745,1,0,2,6,204,27,97
2,3,66,184,2037,43312,0,0,9,2,0,1,8745,1,0,2,6,204,27,78
2,3,66,184,2037,43312,0,0,9,2,0,1,8745,1,0,1,6,204,27,58

Dimension of the data is 310584 X 20

Himaprasoon
  • 2,609
  • 3
  • 25
  • 46

1 Answers1

2

Numpy array for integers defaults to int64 dtype . This was the reason the data was taking more space than the original.

310584 X 20 x 8 ~= 48 MB (8 Bytes is the size for int64)
Himaprasoon
  • 2,609
  • 3
  • 25
  • 46