import numpy as np
import pandas as pd
df = pd.DataFrame(data=np.zeros((1000000,1)))
df.to_csv('test.csv')
df.to_hdf('test.h5', 'df')
ls -sh test*
11M test.csv 16M test.h5
If I use an even larger dataset then the effect is even bigger. Using an HDFStore
like below changes nothing.
store = pd.HDFStore('test.h5', table=True)
store['df'] = np.zeros((1000000,1))
store.close()
Edit: Never mind. The example is bad! Using some non-trivial numbers instead of zeros changes the story.
from numpy.random import rand
import pandas as pd
df = pd.DataFrame(data=rand(10000000,1))
df.to_csv('test.csv')
df.to_hdf('test.h5', 'df')
ls -sh test*
260M test.csv 153M test.h5
Expressing numbers as floats should take less bytes than expressing them as strings of characters with one character per digit. This is generally true, except in my first example, in which all the numbers were '0.0'. Thus, not many characters were needed to represent the number, and so the string representation was smaller than the float representation.