4

I have a large HDF5 file (~30GB) and I need to shuffle the entries (along the 0 axis) in each dataset. Looking through the h5py docs I wasn't able to find either randomAccess or shuffle functionality, but I'm hoping that I've missed something.

Is anyone familiar enough with HDF5 to think of a fast way to random shuffle the data?

Here is pseudocode of what I would implement with my limited knowledge:

for dataset in datasets:
    unshuffled = range(dataset.dims[0])
    while unshuffled.length != 0:
        if unshuffled.length <= 100:
            dataset[:unshuffled.length/2], dataset[unshuffled.length/2:] = dataset[unshuffled.length/2:], dataset[:unshuffled.length/2]
            break
        else:
            randomIndex1 = rand(unshuffled.length - 100)
            randomIndex2 = rand(unshuffled.length - 100)

            unshuffled.removeRange(randomIndex1..<randomIndex1+100)
            unshuffled.removeRange(randomIndex2..<randomIndex2+100)

            dataset[randomIndex1:randomIndex1 + 100], dataset[randomIndex2:randomIndex2 + 100] = dataset[randomIndex2:randomIndex2 + 100], dataset[randomIndex1:randomIndex1 + 100]
Aidan Gomez
  • 8,167
  • 5
  • 28
  • 51
  • 2
    I suspect that shuffling large volumes of data around a file (HDF5 or not) is always going to be slow. I'd be thinking of adding an extra dataset to use as an indirect index into the data. Every time you want to shuffle the data, shuffle the indirect index instead. – High Performance Mark Nov 24 '15 at 18:21
  • 1
    @HighPerformanceMark initially that was my solution, to just keep a shuffled array of indices; however, in my particular problem this is an issue because I need fast fetching so I need to prefer being able to fetch a contiguous range instead of fetching element by element. That's why I'm thinking a post-processing script is going to be the only answer. – Aidan Gomez Nov 24 '15 at 18:26

2 Answers2

2

You can use random.shuffle(dataset). This takes a little more than 11 minutes for a 30 GB dataset on my laptop with a Core i5 processor, 8 GB of RAM, and a 256 GB SSD. See the following:

>>> import os
>>> import random
>>> import time
>>> import h5py
>>> import numpy as np
>>>
>>> h5f = h5py.File('example.h5', 'w')
>>> h5f.create_dataset('example', (40000, 256, 256, 3), dtype='float32')
>>> # set all values of each instance equal to its index
... for i, instance in enumerate(h5f['example']):
...     h5f['example'][i, ...] = \
...             np.ones(instance.shape, dtype='float32') * i
...
>>> # get file size in bytes
... file_size = os.path.getsize('example.h5')
>>> print('Size of example.h5: {:.3f} GB'.format(file_size/2.0**30))
Size of example.h5: 29.297 GB
>>> def shuffle_time():
...     t1 = time.time()
...     random.shuffle(h5f['example'])
...     t2 = time.time()
...     print('Time to shuffle: {:.3f} seconds'.format(str(t2 - t1)))
...
>>> print('Value of first 5 instances:\n{}'
...       ''.format(str(h5f['example'][:10, 0, 0, 0])))
Value of first 5 instances:
[ 0.  1.  2.  3.  4.]
>>> shuffle_time()
Time to shuffle: 673.848 seconds
>>> print('Value of first 5 instances after '
...       'shuffling:\n{}'.format(str(h5f['example'][:10, 0, 0, 0])))
Value of first 5 instances after shuffling:
[ 15733.  28530.   4234. 14869.  10267.]
>>> h5f.close()

Performance for shuffling several smaller datasets should not be worse than this.

forty_two
  • 454
  • 3
  • 11
1

this is my solution picture

input

def shuffle(*datas):
    import random
    for d in datas:
        random.seed(666)
        random.shuffle(d)
a = list(range(6))
b = list(range(6))
c = list(range(6))
shuffle(a,b,c)
a,b,c

output

([2, 0, 1, 4, 5, 3], [2, 0, 1, 4, 5, 3], [2, 0, 1, 4, 5, 3])

input

os.chdir("/usr/local/dataset/flicker25k/")
file = h5py.File("./FLICKR-25K.h5","r+")
print(os.path.getsize("./FLICKR-25K.h5"))
images = file['images']
labels = file['LAll']
tags = file['YAll']
shuffle(images,tags,labels)

output

executed in 27.9s, finished 22:49:53 2019-05-21
3320572656
wood west
  • 19
  • 3