Pandas has a method .to_hdf()
to save a dataframe
as a HDF table.
However each time the command .to_hdf(path, key)
is run, the size of the file increases.
import os
import string
import pandas as pd
import numpy as np
size = 10**4
df = pd.DataFrame({"C":np.random.randint(0,100,size),
"D": np.random.choice(list(string.ascii_lowercase), size = size)})
for iteration in range(4):
df.to_hdf("a_file.h5","key1")
print(os.path.getsize("a_file.h5"))
And the output clearly shows that the size of the file is increasing:
# 1240552
# 1262856
# 1285160
# 1307464
As a new df is saved each time, the hdf size should be constant.
As the increase seems quite modest for small df, with larger df it fastly leads to hdf files that are significantly bigger than the size of the file on the first save.
Sizes I get with a 10**7 long dataframe after 7 iterations:
29MB, 48MB, 67MB, 86MB, 105MB, 125MB, 144MB
Why is it so that the hdf file size is not constant and increase a each new to_hdf()
?