Access HDF files stored on s3 in pandas

Question

I'm storing pandas data frames dumped in HDF format on S3. I'm pretty much stuck as I can't pass the file pointer, the URL, the s3 URL or a StringIO object to read_hdf. If I understand it correctly the file must be present on the filesystem.

Source: https://github.com/pydata/pandas/blob/master/pandas/io/pytables.py#L315

It looks like it's implemented for CSV but not for HDF. Is there any better way to open those HDF files than copy them to the filesystem?

For the record, these HDF files are being handled on a web server, that's why I don't want a local copy.

If I need to stick with the local file: Is there any way to emulate that file on the filesystem (with a real path) which can be destroyed after the reading is done?

I'm using Python 2.7 with Django 1.9 and pandas 0.18.1.

As much as I know, S3 does not easily allow random access to files, which would be needed to read single chunks from HDF5 file. But I'd be glad to be wrong. — kakk11, Sep 09 '16 at 09:03
Yeah, you're right. As I continued my research, it turned out, that HDF doesn't really suit our use case as we wanted to store separate HDF files per data frame. We switched back to good old pickle files, they do their job pretty well. — fodma1, Sep 09 '16 at 09:07

score 4 · Answer 1 · answered Oct 24 '19 at 16:23

Newer versions of python allow to read an hdf5 directly from S3 as mentioned in the read_hdf documentation. Perhaps you should upgrade pandas if you can. This of course assumes you've set the right access rights to read those files: either with a credentials file or with public ACLs.

Regarding your last comment, I am not sure why storing several HDF5 per df would necessarily be contra-indicated to the use of HDF5. Pickle should be much slower than HDF5 though joblib.dump might partially improve on this.

Thanks for the answer, this is really great news, but I have no chance to try — fodma1, Oct 24 '19 at 19:41

Access HDF files stored on s3 in pandas

1 Answers1