0

I have an HDF5 dataset which I read as a numpy array:

my_file = h5py.File(h5filename, 'r')
file_image = my_file['/image']

and a list of indices called in.

I want to split the image dataset into two separate np.arrays: one containing images corresponding to the in indices, and the other containing images whose indices are not in in. The order of images is very important - I want to split the dataset such that the original order of the images is preserved within each subset. How can I achieve this?

I tried the following:

labeled_image_dataset = list(file_image[in])

However, h5py gave me an error saying that the indices must be in increasing order. I can't change the order of indices, since I need the images to stay in their original order.

My code:

my_file = h5py.File(h5filename, 'r')
file_image = my_file['/imagePatches']
li = dataframe["label"]
temporary_list = file_image

# select images whose indices exist in "in"
labeled_dataset = list(temporary_list[in])

# select images whose indices don't exist in "in"
unlabeled_dataset = np.delete(temporary_list, in, 0)
ali_m
  • 71,714
  • 23
  • 223
  • 298
ga97rasl
  • 307
  • 2
  • 7
  • 15

2 Answers2

1

I'm not following why or how you are choosing the indexes, but as the error indicates, when you index an array on the h5 file, then indexes have to be sorted. Remember files are serial storage, so it is easier and faster to read straight through rather than go back and forth. Regardless of where the constraint lies, in h5py or the h5 back end, idx has to ordered.

But if you load the whold array into memory (or some contiguous chunk), that may require a copy, then you can use an unsorted, or even repetitious, idx list.

In other words, h5py arrays can be indexed like numpy arrays, but with some limits.

http://docs.h5py.org/en/latest/high/dataset.html#fancy-indexing

hpaulj
  • 221,503
  • 14
  • 230
  • 353
1

Using in as a variable name is not a good idea, since in is a Python keyword (e.g. see the list comprehensions below). For clarity I've renamed it as idx

One straightforward solution would be to simply iterate over your set of indices in a standard Python for loop or list comprehension:

labelled_dataset = [file_image[ii] for ii in idx]
unlabelled_dataset = [file_image[ii] for ii in range(len(file_image)) 
                      if ii not in idx]

It might be faster to use vectorized indexing, e.g. if you have a very large number of small image patches to load. In that case you could use np.argsort to find the set of indices that will sort idx in ascending order, then index into file_image with the sorted indices. Afterwards you can "undo" the effect of sorting idx by indexing into the resulting array with the same set of indices that were used to sort idx.

Here's a simplified example to illustrate:

target = np.array(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'])
idx = np.array([5, 1, 9, 0, 3])

# find a set of indices that will sort `idx` in ascending order
order = np.argsort(idx)
ascending_idx = idx[order]

# use the sorted indices to index into the target array
ascending_result = target[ascending_idx]

# "undo" the effect of sorting `idx` by indexing the result with `order`
result = ascending_result[order]

print(np.all(result == target[idx]))
# True
ali_m
  • 71,714
  • 23
  • 223
  • 298