Will Pandas auto sort new data appended to a dataframe (in hdf5 format)

Question

For example, I got a df1, index from 20200101 to 20210101, dtype is DateTime, and I already save this df to h5 (format is table). Then I got a new df2, index from 20200101 to 20200201, which is in the middle of the index of df1. If I append df2 to this h5, will it auto sort index?

It would be helpful to see the existing workflow. Do you have a simple example to share? (Likely writing the example will answer your question.) In regards to HDF5, it doesn't have any auto-sort capabilities. Pandas might be able to sort a dataframe. You would have to merge df1 and df2 and save the merged, sorted df as a new key (dataset) in HDF5. — kcw78, Jan 16 '22 at 17:16
Review this SO Q&A. [Append data to HDF5 file with Pandas, Python](https://stackoverflow.com/q/46206125/10462884). The last sentence of the second answer addresses your question: "_HDF5 won't do anything with the dataframe's indexes. We need to iron those out before putting the data in or when we take it out._" — kcw78, Jan 16 '22 at 17:34

score 0 · Accepted Answer · answered Jan 16 '22 at 18:10

It's easy enough to write an example to demonstrate the behavior and the solution.

Code below creates 2 dataframes of odd and even numbers and writes them to an HDF5 file in Table format as key='test_data'. (Table format required to append data.) When you run that, you will see that the odd numbers are first, followed by the even numbers (NOT auto-sorted).

col2 = [ x for x in range(1,53,2)]
col3 = [ float(x) for x in range(1,53,2)]

df1 = pd.DataFrame({'col_int': col2, 'col_float': col3})
df1.to_hdf('SO_70731279.h5',key='test_data',mode='w', format='table',
           data_columns=True)

col2 = [ x for x in range(2,54,2)]
col3 = [ float(x) for x in range(2,54,2)]

df2 = pd.DataFrame({'col_int': col2, 'col_float': col3})
df2.to_hdf('SO_70731279.h5',key='test_data',mode='a', format='table',
           data_columns=True, append=True)

This code reads the data from key='test_data' to df3, sorts the dataframe in-place, then writes to the HDF5 file with key='sorted_data'. Run this and you will see this data in the HDF5 file is in ascending order.

df3 = pd.read_hdf('SO_70731279.h5',key='test_data',columns=['col_int','col_float'])
df3.sort_values('col_int',inplace=True,ignore_index=True)
df3.to_hdf('SO_70731279.h5',key='sorted_data',mode='a', format='table',
           data_columns=True)

If you want to REPLACE the data in key='test_data', you have to write df3 to the key and not include append=True. This will overwrite the key.

df3.to_hdf('SO_70731279.h5',key='test_data',mode='a', format='table',
           data_columns=True)  ## Note append=True is missing on this call

This solution will solve your problem. However, computational efficiency with very large dataframes is unknown. You should test with your data before implementing. Good luck.

Thanks a lot. I see. Appending to h5 would not auto-sort index. Yes, as you mentioned, it would take a lot of time to read to df, sort using pandas dataframe, re-write into h5, if the database is big. — Chauncey, Jan 17 '22 at 08:28

Will Pandas auto sort new data appended to a dataframe (in hdf5 format)

1 Answers1