I've got a 150Gb h5py dataset I want to shuffle.
In this post Shuffle HDF5 dataset using h5py the user said it took 11 minutes to shuffle a 30Gb data. However, I tried shuffling my dataset and it seemed to take an awful lot longer than 55 minutes (I eventually had to cancel it).
Does the time not increase linearly with dataset size? How does random.shuffle
work on a dataset? Does it load single elements at a time?
I'm not using chunking or any other special h5py settings. Elements in the dataset are of shape (8, 8, 21)
if that helps, dtype="int32"
.