Empty pandas DataFrames saved with HDFStore make very large HDF files

Question

Storing empty DataFrames using pandas HDFStore consumes a lot of disk space. Here's an example:

import pandas as pd

for i in range(1000):
    with pd.HDFStore('/tmp/test_empty.hdf') as s:
        key = 'E{:03d}'.format(i)
        s[key] = pd.DataFrame()

for i in range(1000):
    with pd.HDFStore('/tmp/test_nan.hdf') as s:
        key = 'N{:03d}'.format(i)
        s[key] = pd.DataFrame([pd.np.nan])

The file sizes:

$ ls -lh /tmp/test_empty.hdf /tmp/test_nan.hdf
.... 2.0G Nov 11 11:47 /tmp/test_empty.hdf
.... 5.5M Nov 11 11:47 /tmp/test_nan.hdf

1000 DataFrames containing a single NaN consumes about 400 times less space than 1000 DataFrames that are empty. Is there a more efficient way to mark a key as taken in an HDFStore? (It's non-intuitive that empty DataFrames consume so much space.)

I got this problem as well, seems to be a bug – Alpha Aug 16 '16 at 21:42 — Alpha, Aug 16 '16 at 21:42

score 0 · Answer 1 · answered Nov 10 '21 at 16:18

0

I had the same problem and ended up adding one empty column to the DataFrame.

df = pd.DataFrame({'': []})
df.to_hdf('file_name.hdf', 'key')

After loading, it is trivial to check if this DataFrame is empty

df_loaded = pd.read_hdf("file_name.hdf", "key")
df_loaded.empty # True

answered Nov 10 '21 at 16:18

Konsta

1

This looks like a usable workaround. Thanks for the answer. – user1281657 Nov 23 '21 at 16:11

Empty pandas DataFrames saved with HDFStore make very large HDF files

1 Answers1