Storing empty DataFrames using pandas HDFStore consumes a lot of disk space. Here's an example:
import pandas as pd
for i in range(1000):
with pd.HDFStore('/tmp/test_empty.hdf') as s:
key = 'E{:03d}'.format(i)
s[key] = pd.DataFrame()
for i in range(1000):
with pd.HDFStore('/tmp/test_nan.hdf') as s:
key = 'N{:03d}'.format(i)
s[key] = pd.DataFrame([pd.np.nan])
The file sizes:
$ ls -lh /tmp/test_empty.hdf /tmp/test_nan.hdf
.... 2.0G Nov 11 11:47 /tmp/test_empty.hdf
.... 5.5M Nov 11 11:47 /tmp/test_nan.hdf
1000 DataFrames containing a single NaN consumes about 400 times less space than 1000 DataFrames that are empty. Is there a more efficient way to mark a key as taken in an HDFStore? (It's non-intuitive that empty DataFrames consume so much space.)