1

I am using the following combination of h5py and s3fs to read a couple of small datasets from larger HDF5 files on Amazon S3.

s3 = s3fs.S3FileSystem()
h5_file = h5py.File(s3.open(s3_path,'rb'), 'r')
data = h5_file.get(dataset)

These reads are relatively slow, and it seems like reading a single dataset this way is about as slow as copying over the entire file from the S3 bucket locally and then reading the dataset. I assume the reason is that there's a lot of overhead in the seek and read commands that h5py is sending via s3fs.

Does anyone have an idea for a more optimal approach? (apart from downloading the file and then reading it, which is faster if I want to read multiple datasets, but still far too slow)

Thanks!

Emmanuel

  • The bottleneck is NOT in h5py. Getting file and dataset objects are fast operations (and not a function of file or dataset size). If this file was local, the 2 lines would execute almost instantly. The bottleneck is either in `s3.open()` or due to network performance from your computer to Amazon S3 server. – kcw78 Apr 29 '21 at 21:51
  • @kcw78 Thanks! I managed to achieve a significant speedup by using the ros3 driver in h5py, rather than going via s3fs. Query time goes from about 12 seconds to just 300ms when reading a small dataset from a 150MB HDF5 file on S3 The only problem is that ros3 seems to only be able to read public url's (or it doesn't read my .aws/credentials properly), which is the next headache to solve ```file=h5py.File(s3_url,'r',driver='ros3')``` – Emmanuel Wildiers Apr 30 '21 at 23:30
  • I am not familiar with the ros3 driver. Did you include the `secret_id=` and `secret_key=` parameters? If not, give that try. If that doesn't solve the problem, I suggest posting your question on the h5py forum (hosted by The HDF Group). [h5py on HDF Forum](https://forum.hdfgroup.org/c/hdf-tools/h5py) – kcw78 May 01 '21 at 13:12
  • Thanks @kcw78 , I had indeed included the authentication parameters, but that didn't help. Managed to figure out in the end what the problem was - not the authentication itself, but rather the URL format: bucket.amazon/file was apparently not working properly, format amazon/bucket/file did work. No idea why and not easy to figure out as the error messages are very vague (just a general cURL error). Anyway, hope people having the same problem in the future can be helped with this :-) – Emmanuel Wildiers May 03 '21 at 12:16

0 Answers0