HDF5 with Python, Pandas: Data Corruption and Read Errors

Question

So I'm trying to store Pandas DataFrames in HDF5 and getting strange errors, rather inconsistently. At least half the time, some part of the read-process-move-write cycle fails, often with no clearer explanation than "HDF5 Read Error". Even worse, sometimes the table ends up with nonsense/corrupted data that doesn't stop things until downstream -- either values that are off by orders of magnitude (and not even correlated with the correct ones) or dates that don't make sense (recent data mismarked as being dated in the 1750s...etc).

I thought I'd go through the current process and then the things that I suspect might be causing problems of that might help. Here's what it looks like:

Read some of the tables (call them "QUERY1" and "QUERY2") to see if they're up to date, and if they arent,
Take the table that had been in the HDF5 store as "QUERY1" and store it as QUERY1_YYYY_MM_DD" in the HDF5 store instead
Run the associated query on external database for that table. Each one is between 100 and 1500 columns of daily data back to 1980.
Store the result of query 1 as the new "QUERY1" in the HDF5 store
Compute several transformations of one or more of QUERY1, QUERY2,...QUERYn which will have hierarchical (Pandas MultiIndex) columns. Overwrite item "Derived_Frame1"...etc with its update/replacement in the HDF5 store
Multiple people with access to the relevant .h5 file on a Windows network drive run this routine -- potentially sometimes, but not usually, at the same time.

Some things I suspect could be part of the problem:

using default format (df.to_hdf(store, key)) instead of insisting on "Table" format with df.to_hdf(store, key, format='table')). I do this because default format is between 2 and 5x faster on both the read and the write according to %timeit
Using a network drive to allow several users to run this routine and access at least the derived frames. Not much I can do about this requirement, especially for read access to the derived dataframes at any time.
From the docs, it sounds like repeatedly deleting and re-writing an item in the HDF5 store can do weird things (at least gradually increasing the file size, not sure what else). Maybe I should be storing query archives in another file? Maybe I should be nuking and replacing the whole main file upon update?
Storing dataframes with MultiIndex columns in HDF5 in the first place -- this seems to be what gets me a "warning" under the default format, although it seems like the warning goes away if I use format='table'.
Edit: it is also possible/likely that different users running the routine above are using different versions of Pandas and different versions of PyTables.

Any ideas?

At least concurrent access to HDF5 might be a problem depending on version, see https://www.hdfgroup.org/HDF5/docNewFeatures/ — J.J. Hakala, Aug 06 '16 at 21:45
http://pandas.pydata.org/pandas-docs/stable/io.html#notes-caveats the big bold warning. you have to make sure access is single threaded at all times. — Jeff, Aug 06 '16 at 22:04
Ah, I see. Just to clarify -- should it be okay (in terms of not corrupting the file) for one user to be reading while another is writing? Is there a "right way" to do this -- e.g. write a temporary file instead of the main one, and try to replace main with temporary if it's not being used elsewhere? — Josiah Stevenson, Aug 06 '16 at 22:25
Oof, looks like read isn't really threadsafe either, especially with write going on somewhere else (https://github.com/pydata/pandas/issues/2397). — Josiah Stevenson, Aug 06 '16 at 22:34
HDF5 is NOT thread safe in any way; if you need multiple reads while writing and you don't lock yourself simply use a database — Jeff, Aug 07 '16 at 01:19

HDF5 with Python, Pandas: Data Corruption and Read Errors

0 Answers0