What is an efficient implementation of a custom map-style dataset for a hdf5 file with irregular structure?

Question

I have a hdf5 file that contains picture of a certain number of people, from a certain number of source cameras, for many seconds. So it is like this:

file[seconds][person][camera].

But this is quite irregular, such that for a given second, there may be different number of persons, and for a given second and person there may be picture from different cameras. I want to create a map-style pytorch.dataset, so I need to implement get_item(idx) that will return a unique second, person and camera for that idx.

My first idea is to iterate through the whole dataset and create dictionaries that can be accessed with idx, that is, second[idx] = this_second, person[idx] = this_person, camera[idx] = this_camera. So I can use all of that to get a unique data from the dataset with:

 file[this_second][this_person][this_camera].

However this solution seems too complicated for me. I wonder if there is a better way to solve that, since this is probably a common problem.

score 0 · Answer 1 · answered Dec 23 '21 at 20:49

I agree, a dictionary is too complicated. Instead, create an array where to first index is the item index, and the second axis has 3 values for associated second, person, camera indices. If you plan to do this frequently, you can create a dataset from the array, then use the dataset.

Psuedo-code provided below:

#create array for index values
idx_arr = np.zeros((no_idxs,3),dtype=int)  
i_cnt = 0
#Loop on data:
for...    
    # get this second, person, camera data
    # then add to index array
    idx_arr[i_cnt] = [ this_second, this_person, this_camera ]
    i_cnt += 1

with h5py.File(your_hdf5_file,'a') as h5f:
    create_dataset('indices',data=idx_array)

with h5py.File(your_hdf5_file,'r') as h5f:
    idx_ds = h5f['indices']
    img_ds = h5f['your_image_dataset_name']
    
    for row_arr in idx_ds:
        # use row_arr values to get next second/person/camera image
        img = img_ds[row_arr[0],row_arr[1],row_arr[2]]and store as a data set

What is an efficient implementation of a custom map-style dataset for a hdf5 file with irregular structure?

1 Answers1