So I'm trying to store Pandas DataFrames in HDF5 and getting strange errors, rather inconsistently. At least half the time, some part of the read-process-move-write cycle fails, often with no clearer explanation than "HDF5 Read Error". Even worse, sometimes the table ends up with nonsense/corrupted data that doesn't stop things until downstream -- either values that are off by orders of magnitude (and not even correlated with the correct ones) or dates that don't make sense (recent data mismarked as being dated in the 1750s...etc).
I thought I'd go through the current process and then the things that I suspect might be causing problems of that might help. Here's what it looks like:
- Read some of the tables (call them "QUERY1" and "QUERY2") to see if they're up to date, and if they arent,
- Take the table that had been in the HDF5 store as "QUERY1" and store it as QUERY1_YYYY_MM_DD" in the HDF5 store instead
- Run the associated query on external database for that table. Each one is between 100 and 1500 columns of daily data back to 1980.
- Store the result of query 1 as the new "QUERY1" in the HDF5 store
- Compute several transformations of one or more of QUERY1, QUERY2,...QUERYn which will have hierarchical (Pandas MultiIndex) columns. Overwrite item "Derived_Frame1"...etc with its update/replacement in the HDF5 store
- Multiple people with access to the relevant .h5 file on a Windows network drive run this routine -- potentially sometimes, but not usually, at the same time.
Some things I suspect could be part of the problem:
using default format (
df.to_hdf(store, key)
) instead of insisting on "Table" format withdf.to_hdf(store, key, format='table')
). I do this because default format is between 2 and 5x faster on both the read and the write according to %timeitUsing a network drive to allow several users to run this routine and access at least the derived frames. Not much I can do about this requirement, especially for read access to the derived dataframes at any time.
From the docs, it sounds like repeatedly deleting and re-writing an item in the HDF5 store can do weird things (at least gradually increasing the file size, not sure what else). Maybe I should be storing query archives in another file? Maybe I should be nuking and replacing the whole main file upon update?
Storing dataframes with MultiIndex columns in HDF5 in the first place -- this seems to be what gets me a "warning" under the default format, although it seems like the warning goes away if I use
format='table'
.Edit: it is also possible/likely that different users running the routine above are using different versions of Pandas and different versions of PyTables.
Any ideas?