0

I've got a 150Gb h5py dataset I want to shuffle.

In this post Shuffle HDF5 dataset using h5py the user said it took 11 minutes to shuffle a 30Gb data. However, I tried shuffling my dataset and it seemed to take an awful lot longer than 55 minutes (I eventually had to cancel it).

Does the time not increase linearly with dataset size? How does random.shuffle work on a dataset? Does it load single elements at a time?

I'm not using chunking or any other special h5py settings. Elements in the dataset are of shape (8, 8, 21) if that helps, dtype="int32".

Charlie
  • 176
  • 11
  • `random.shuffle` is written in python. I see repeated `x[i], x[j] = x[j], x[i]`, where `j` is a random value larger than `i`. Note that in your link none of the answers were accepted by the OP. – hpaulj Jul 19 '19 at 15:32
  • Note that the timing benchmark in the linked post is for a file on an SSD drive. I/O access times on a mechanical hard drive (HDD) will be substantially slower. – kcw78 Jul 19 '19 at 18:58

0 Answers0