1

I have an hdf5 file of shape 80000 * 401408. I need to read data from it in batches of size 64 but the indices can be random say (5, 0, 121, .., 2).

The problem is that while initially the reads are quite consistent and a batch takes say 0.5 seconds to complete, after a while some of the batches take longer upto 10 seconds while some batches are still being read fast. I have observed as more and more reads take place, the reading process is slowing down.

hf = h5py.File( conv_file,'r')
conv_features = hf['conv_features']
while True:
    conv_batch = [None for i in range(64)]
    for i in range(0, 64):
        conv_batch[count] = np.reshape(conv_features[some_random_index], [14, 14, 2048] )
    # time for each of the above reads for conv_bacth is different.. varies from 0.5 to 5 seconds.. and slows down over time.

I am not using chunks

Stefan Falk
  • 23,898
  • 50
  • 191
  • 378
  • Sure you're not running out of memory and start working on your swapping device or something like that? – Stefan Falk Oct 02 '17 at 22:14
  • No, I am not running out of memory. The process is just getting slower over time. I am reusing the same variables and only reading the reference to the hdf5 file once. – user3682478 Oct 02 '17 at 22:19
  • 1
    Not quite sure if one can determine the problem from the example you're showing there. I'm also not sure why you're using the `deep-learning` tag here. Maybe post your actual code and not just this small excerpt. – Stefan Falk Oct 02 '17 at 22:33
  • displayname is right, it is hard to diagnose this problem if we don't have enough information to reproduce it. – Paulo Scardine Oct 02 '17 at 22:42
  • I wanted to know if it is usual for hdf5 to have variable time reads (because of the random indices access) The time to read 64 vectors spans from 0.5 to 5 seconds and as I mentioned, it slows down over time. Apart from the above, there is very little that has anything to do with hdf5. – user3682478 Oct 02 '17 at 22:52
  • The HDF5 library caches metadata and data about open HDF5 files. Have you tried closing/opening the file for every access? It is worth timing this approach. – Pierre de Buyl Oct 02 '17 at 23:12
  • `h5py` 'fancy-indexing' warns that indexing performance may be poor. http://docs.h5py.org/en/latest/high/dataset.html#fancy-indexing. Remember that a disk file is inherently a serial device, so reading from two widely separated 'rows' requires large file `seek` moves. – hpaulj Oct 03 '17 at 00:14
  • Is the file compressed? – kakk11 Oct 03 '17 at 06:16
  • @kakk11 No the file is not compressed. – user3682478 Oct 03 '17 at 22:44
  • @hpaulj I see. The data is on an SSD so thought that should not be a big issue. I am more concerned with why the random access is way faster in some cases than others. – user3682478 Oct 03 '17 at 22:45
  • @PierredeBuyl Thanks for the suggestion! I will try this and let you know if it worked. – user3682478 Oct 03 '17 at 22:45
  • I tried it with a smaller dataset (3000,401408) and could not reproduce your problem (Win64, Python 2.7, newest available h5py-version available in Anaconda). The read speed at the beginning is aproximately the sequential read speed of my SSD and becomes faster because some data is chached in RAM. Could you provide more information (Python version, h5py version, operating system)? – max9111 Oct 09 '17 at 11:22

1 Answers1

0

Have you tried controlling chunk size with your dataset. Set the chunk size to a reasonable and commonly accessed portion.

e.g if you commonly accessed your 80000 * 401408 data per row, the chunks would be efficiently read if the chunks were (1, 401408) or perhaps (1, 200704).

Depending on your access pattern, the chunk size can affect hugely on time taken to access. You can also consider using compression.

Marcus Lim
  • 567
  • 2
  • 5
  • 14