Faster solution for sampling an index by value of ndarray

Question

I have some pretty large arrays to deal with. By describing them big, I mean like in the scale of (514, 514, 374). I want to randomly get an index base on its pixel value. For example, I need the 3-d index of a pixl with value equal to 1. So, I list all the possibilities by

indices = np.asarray(np.where(img_arr == 1)).T

This works perfect, except that it runs very slow, to an intolerable extent, since the array is so big. So my question is is there a better way to do that? It would be nicer if I can input a list of pixel values, and I get back a list of corresponding indices. For example, I want to sample the indices of these pixel values [0, 1, 2], and I get back list of indices [[1,2,3], [53, 215, 11], [223, 42, 113]]
Since I am working with medical images, solutions with SimpleITK is also welcomed. So feel free to leave your opinions, thanks.

What kind of values are in the array? Only integers? Do you have a rough idea of the distribution of the values? Like 10% are ones? You could first sample a subset of maybe 10000 values randomly and than chose from this much smaller subset. — scleronomic, Sep 25 '19 at 08:56
@scleronomic, hi. Only integers are involved. It is a mask actually, with 40% being ones probably. But I don't know how to get a subset and then choose from it, would you care to formulate as an answer? I will accept that — yujuezhao, Sep 25 '19 at 09:00
@scleronomic I can sample a subset of indices, but i don't know how to use the subset of indices to sample one index base on its value — yujuezhao, Sep 25 '19 at 09:02
@Divakar It is a good point though,. But is there any difference with `np.where`? — yujuezhao, Sep 25 '19 at 09:05
@yujuezhao `argwhere` simply combines the three tuples for the three dims into rows each for a compact output. Just a convenience tool. — Divakar, Sep 25 '19 at 09:06

scleronomic · Accepted Answer · 2019-09-25T11:26:39.093

import numpy as np
value = 1
# value_list = [1, 3, 5] you can also use a list of values -> *
n_samples = 3
n_subset = 500

# Create a example array
img_arr = np.random.randint(low=0, high=5, size=(10, 30, 20))

# Choose randomly indices for the array
idx_subset = np.array([np.random.randint(high=s, size=n_subset) for s in x.shape]).T  
# Get the values at the sampled positions
values_subset = img_arr[[idx_subset[:, i] for i in range(img_arr.ndim)]]  
# Check which values match
idx_subset_matching_temp = np.where(values_subset == value)[0]
# idx_subset_matching_temp = np.argwhere(np.isin(values_subset, value_list)).ravel()  -> *
# Get all the indices of the subset with the correct value(s)
idx_subset_matching = idx_subset[idx_subset_matching_temp, :]  
# Shuffle the array of indices
np.random.shuffle(idx_subset_matching)  
# Only keep as much as you need
idx_subset_matching = idx_subset_matching[:n_samples, :]

This gives you the desired samples. The distribution of those samples should be the same as if you are using your method of looking at all matches in the array. In both cases you get a uniform distribution along all the positions with matching values.

You have to be careful when choosing the size of the subset and the number of samples you want. The subset must be large enough that there are enough matches for the values, otherwise it won't work. A similar problem occurs if the values you want to sample are very sparse, then the size of the subset needs to be very large (in the edge case the whole array) and you gain nothing.

If you are sampling often from the same array maybe it is also a good idea to store the indices for each value

indices_i = np.asarray(np.where(img_arr == i)).T

and use those for the your further computations.

You make very compelling case here, but I benefit the most from your last comment, which is storing the indices and that is exactly what I am about to do . — yujuezhao, Sep 25 '19 at 12:03

Faster solution for sampling an index by value of ndarray

1 Answers1