I have a large HDF5 file (~30GB) and I need to shuffle the entries (along the 0 axis) in each dataset. Looking through the h5py docs I wasn't able to find either randomAccess
or shuffle
functionality, but I'm hoping that I've missed something.
Is anyone familiar enough with HDF5 to think of a fast way to random shuffle the data?
Here is pseudocode of what I would implement with my limited knowledge:
for dataset in datasets:
unshuffled = range(dataset.dims[0])
while unshuffled.length != 0:
if unshuffled.length <= 100:
dataset[:unshuffled.length/2], dataset[unshuffled.length/2:] = dataset[unshuffled.length/2:], dataset[:unshuffled.length/2]
break
else:
randomIndex1 = rand(unshuffled.length - 100)
randomIndex2 = rand(unshuffled.length - 100)
unshuffled.removeRange(randomIndex1..<randomIndex1+100)
unshuffled.removeRange(randomIndex2..<randomIndex2+100)
dataset[randomIndex1:randomIndex1 + 100], dataset[randomIndex2:randomIndex2 + 100] = dataset[randomIndex2:randomIndex2 + 100], dataset[randomIndex1:randomIndex1 + 100]