1

Pandas has a method .to_hdf() to save a dataframe as a HDF table. However each time the command .to_hdf(path, key) is run, the size of the file increases.

import os
import string
import pandas as pd
import numpy as np

size = 10**4
df = pd.DataFrame({"C":np.random.randint(0,100,size), 
                   "D": np.random.choice(list(string.ascii_lowercase), size = size)})
for iteration in range(4):
    df.to_hdf("a_file.h5","key1")
    print(os.path.getsize("a_file.h5"))

And the output clearly shows that the size of the file is increasing:

# 1240552
# 1262856
# 1285160
# 1307464

As a new df is saved each time, the hdf size should be constant.

As the increase seems quite modest for small df, with larger df it fastly leads to hdf files that are significantly bigger than the size of the file on the first save.

Sizes I get with a 10**7 long dataframe after 7 iterations:

29MB, 48MB, 67MB, 86MB, 105MB, 125MB, 144MB

Why is it so that the hdf file size is not constant and increase a each new to_hdf()?

Adrien Pacifico
  • 1,649
  • 1
  • 15
  • 33
  • 1
    in `pd.DataFrame.to_hdf()` the default `mode` value is `'a'` which means append. Try `df.to_hdf("a_file.h5","key1", mode='w')` – jeschwar Feb 26 '19 at 20:35

1 Answers1

2

This behavior is not really documented if you look in a fast manner at the documentation (which is 2973 pdf pages long). But can be found in #1643, and in the warning in IO Tools section/delete from a table section of the documentation: If you do not specify anything, the default writing mode is 'a'which is the case of a simple df.to_hdf('a_path.h5','a_key') will nearly double the size of your hdf file each time you run your script.

Solution is to use the write mode: df.to_hdf('a_path.h5','a_key', mode = 'w')

However, this behavior will happen only with the fixed format (which is the default format) but not with the table format (except if append is set to True).

Adrien Pacifico
  • 1,649
  • 1
  • 15
  • 33