Randomly sample sub-arrays from a 2D array in python

Question

Problem:

Let's say I have a 2D array from which I want to randomly sample (using Monte-Carlo) smaller 2D sub-arrays as shown by the black patches in the figure below. I am looking for an efficient method of doing this.

Prospective (but partial) solution:

I came across one function that partially achieves what I am trying to do after several hours of search, but it lacks the ability to sample a patch at a random location. At least I don't think it can sample from random locations based on its arguments, although it does have one random_state argument that I do not understand.

sklearn.feature_extraction.image.extract_patches_2d(image, patch_size, max_patches=None, random_state=None)

Question:

Select random patch coordinates (2D sub-array) and use them to slice a patch from the bigger array as shown in figure above. The randomly sampled patches are allowed to overlap.

See the solution under the question where it says `..lacks the ability to sample a patch at a random location`. — user11, Nov 19 '17 at 03:25
Question seems pretty clear, people are way too liberal with downvoting in my opinion. — Brad Solomon, Nov 19 '17 at 03:25
What's the desired distribution of patch sizes? (or, distributions for width and height) — realharry, Nov 19 '17 at 03:27
@realharry: The desired distribution is Monte Carlo sampling, which is essentially a random number from a uniform distribution. However, instead of just (0, 1), it would be `a + (b - a)*(0, 1)`. I don't want this to confuse the primary objective...distribution here is not that important right now. — user11, Nov 19 '17 at 03:28
A uniform distribution between some min and max? At least, like [0, max]? — realharry, Nov 19 '17 at 03:31
Generating random patch *coordinates/parameters* and using them to slice a numpy ndarray - would that be efficient? — wwii, Nov 19 '17 at 03:31
@wwii: Yes, that's what I want. I used `efficient` as a term to refer to the method/function/algorithm that can achieve that rather than a brute method, which would be using loops. — user11, Nov 19 '17 at 03:32
Somewhat related: [Generating random non-overlapping squares in a plane](https://stackoverflow.com/questions/46081491/how-to-generate-randomly-located-squares-of-equal-size-on-a-1x1-grid-that-have/46102304#46102304) — Brad Solomon, Nov 19 '17 at 16:30

James · Accepted Answer · 2017-11-19T22:41:34.927

Here is a sampler that creates a sample cut from an array of any dimensionality. It uses functions to control where to start the cut and for how wide the cut should be along any axis.

Here is an explanation of the parameters:

arr - the input numpy array.
loc_sampler_fn - this is the function you want to use to set the corner of the box. If you want the corner of the box to be sampled uniformly from the anywhere along the axis, use np.random.uniform. If you want the corner to be closer to the center of the array, use np.random.normal. However, we need to tell the function what range to sample over. This brings us to the next parameter.
loc_dim_param - this passes the size of each axis to loc_sampler_fn. If we are using np.random.uniform for the location sampler, we want to sample from the entire range of the axis. np.random.uniform has two parameters: low and high, so by passing the length of the axis to high it samples uniformly over the entire axis. In other words, if the axis has length 120 we want np.random.uniform(low=0, high=120), so we would set loc_dim_param='high'.
loc_params - this passes any additional parameters to loc_sampler_fn. Keeping with the example, we need to pass low=0 to np.random.uniform, so we pass the dictionary loc_params={'low':0}.

From here, it is basically identical for the shape of the box. If you want the box height and width to be uniformly sampled from 3 to 10, pass in shape_sampler_fn=np.random.uniform, with shape_dim_param=None since we are not using the size of the axis for anything, and shape_params={'low':3, 'high':11}.

def box_sampler(arr, 
                loc_sampler_fn, 
                loc_dim_param, 
                loc_params, 
                shape_sampler_fn, 
                shape_dim_param,
                shape_params):
    '''
    Extracts a sample cut from `arr`.

    Parameters:
    -----------
    loc_sampler_fn : function
        The function to determine the where the minimum coordinate
        for each axis should be placed.
    loc_dim_param : string or None
        The parameter in `loc_sampler_fn` that should use the axes
        dimension size
    loc_params : dict
        Parameters to pass to `loc_sampler_fn`.
    shape_sampler_fn : function
        The function to determine the width of the sample cut 
        along each axis.
    shape_dim_param : string or None
        The parameter in `shape_sampler_fn` that should use the
        axes dimension size.
    shape_params : dict
        Parameters to pass to `shape_sampler_fn`.

    Returns:
    --------
    (slices, x) : A tuple of the slices used to cut the sample as well as
    the sampled subsection with the same dimensionality of arr.
        slice :: list of slice objects
        x :: array object with the same ndims as arr
    '''
    slices = []
    for dim in arr.shape:
        if loc_dim_param:
            loc_params.update({loc_dim_param: dim})
        if shape_dim_param:
            shape_params.update({shape_dim_param: dim})
        start = int(loc_sampler_fn(**loc_params))
        stop = start + int(shape_sampler_fn(**shape_params))
        slices.append(slice(start, stop))
    return slices, arr[slices]

Example for a uniform cut on a 2D array with widths between 3 and 9:

a = np.random.randint(0, 1+1, size=(100,150))
box_sampler(a, 
            np.random.uniform, 'high', {'low':0}, 
            np.random.uniform, None, {'low':3, 'high':10})
# returns:
([slice(49, 55, None), slice(86, 89, None)], 
 array([[0, 0, 1],
        [0, 1, 1],
        [0, 0, 0],
        [0, 0, 1],
        [1, 1, 1],
        [1, 1, 0]]))

Examples for taking 2x2x2 chunks from a 10x20x30 3D array:

a = np.random.randint(0,2,size=(10,20,30))
box_sampler(a, np.random.uniform, 'high', {'low':0}, 
               np.random.uniform, None, {'low':2, 'high':2})
# returns:
([slice(7, 9, None), slice(9, 11, None), slice(19, 21, None)], 
 array([[[0, 1],
         [1, 0]],
        [[0, 1],
         [1, 1]]]))

Update based on the comments.

For your specific purpose, it looks like you want a rectangular sample where the starting corner is uniformly sampled from anywhere in the array, and the the width of the sample along each axis is uniformly sampled, but can be limited.

Here is a function that generates these samples. min_width and max_width can accept iterables of integers (such as a tuple) or a single integer.

def uniform_box_sampler(arr, min_width, max_width):
    '''
    Extracts a sample cut from `arr`.

    Parameters:
    -----------
    arr : array
        The numpy array to sample a box from
    min_width : int or tuple
        The minimum width of the box along a given axis.
        If a tuple of integers is supplied, it my have the
        same length as the number of dimensions of `arr`
    max_width : int or tuple
        The maximum width of the box along a given axis.
        If a tuple of integers is supplied, it my have the
        same length as the number of dimensions of `arr`

    Returns:
    --------
    (slices, x) : A tuple of the slices used to cut the sample as well as
    the sampled subsection with the same dimensionality of arr.
        slice :: list of slice objects
        x :: array object with the same ndims as arr
    '''
    if isinstance(min_width, (tuple, list)):
        assert len(min_width)==arr.ndim, 'Dimensions of `min_width` and `arr` must match'
    else:
        min_width = (min_width,)*arr.ndim
    if isinstance(max_width, (tuple, list)):
        assert len(max_width)==arr.ndim, 'Dimensions of `max_width` and `arr` must match'
    else:
        max_width = (max_width,)*arr.ndim

    slices = []
    for dim, mn, mx in zip(arr.shape, min_width, max_width):
        fn = np.random.uniform
        start = int(np.random.uniform(0,dim))
        stop = start + int(np.random.uniform(mn, mx+1))
        slices.append(slice(start, stop))
    return slices, arr[slices]

Example of generating a box cut that starts uniformly anywhere in the array, the height is a random uniform draw from 1 to 4 and the width is a random uniform draw from 2 to 6 (just to show). In this case, the size of the box was 3 by 4, starting at the 66th row and 19th column.

x = np.random.randint(0,2,size=(100,100))
uniform_box_sampler(x, (1,2), (4,6))
# returns:
([slice(65, 68, None), slice(18, 22, None)], 
 array([[1, 0, 0, 0],
        [0, 0, 1, 1],
        [0, 1, 1, 0]]))

It seems to be doing what I want, but I am not clear with the arguments of the function after `arr`. I tried looking at the examples and connect them with the argument definitions, but I couldn't understand. Could you please help clarify those arguments or try to connect them with your example? — user11, Nov 19 '17 at 04:07
Very nice. I can understand it now. What about the `shape_*` arguments? Do you mind adding their explanation as well? Thanks. — user11, Nov 19 '17 at 18:09
Can you describe how to sample the boxes based on their width/height? — user11, Nov 19 '17 at 21:55
Are you asking how to have the height or width scale with the size of the axis? — James, Nov 19 '17 at 21:58
Yes, I don't want to sample a very big box (about the size of the global array)...so, I want to constrain the size of the sampled boxes. I believe `shape_*` argument takes care of that, but I don't know how to use it. I would appreciate if you can also explain its usage in the example. Let's say I want to constrain the sampled box to maximum 4 cells along rows and maximum 6 cells along columns of the main array `arr`. — user11, Nov 19 '17 at 22:05
See the additional information in the answer. For your particular case, the function is much simpler since you only want uniform sampling of the array and the width/height of the box. — James, Nov 19 '17 at 22:42
Thanks. The arguments for the function with non-uniform sampling was not easy to understand. — user11, Nov 19 '17 at 23:15

Brad Solomon · Answer 2 · 2017-11-19T20:31:04.447

So it seems like your issue with sklearn.feature_extraction.image.extract_patches_2d is that it forces you to to specify a single patch size, whereas you are looking for different patches of random size.

One thing to note here is that your result can't be a NumPy array (unlike the result of the sklearn function) because arrays have to have uniform-length rows/columns. So your output needs to be some other data structure that contains differently-shaped arrays.

Here's a workaround:

from itertools import product

def random_patches_2d(arr, n_patches):
    # The all possible row and column slices from `arr` given its shape
    row, col = arr.shape
    row_comb = [(i, j) for i, j in product(range(row), range(row)) if i < j]
    col_comb = [(i, j) for i, j in product(range(col), range(col)) if i < j]

    # Pick randomly from the possible slices.  The distribution will be
    #     random uniform from the given slices.  We can't use
    #     np.random.choice because it only samples from a 1d array.
    a = np.random.choice(np.arange(len(row_comb)), size=n_patches)
    b = np.random.choice(np.arange(len(col_comb)), size=n_patches)
    for i, j in zip(a, b):
        yield arr[row_comb[i][0]:row_comb[i][1], 
                  col_comb[i][0]:col_comb[i][1]]

Example:

np.random.seed(99)
arr = np.arange(49).reshape(7, 7)
res = list(random_patches_2d(arr, 5))
print(res[0])
print()
print(res[3])
[[0 1]
 [7 8]]

[[ 8  9 10 11]
 [15 16 17 18]
 [22 23 24 25]
 [29 30 31 32]]

Condensed:

def random_patches_2d(arr, n_patches):
    row, col = arr.shape
    row_comb = [(i, j) for i, j in product(range(row), range(row)) if i < j]
    col_comb = [(i, j) for i, j in product(range(col), range(col)) if i < j]
    a = np.random.choice(np.arange(len(row_comb)), size=n_patches)
    b = np.random.choice(np.arange(len(col_comb)), size=n_patches)
    for i, j in zip(a, b):
        yield arr[row_comb[i][0]:row_comb[i][1], 
                  col_comb[i][0]:col_comb[i][1]]

Addressing your comment: you could successively add 1 patch and check the area after each.

# `size` is just row x col
area = arr.size
patch_area = 0
while patch_area <= area:  # or while patch_area <= 0.1 * area:
    patch = random_patches_2d(arr, n_patches=1)
    patch_area += patch

Thanks for providing your solution. I think it is helpful, but I wanted to sample one box at a time (which I should have explained in the question..sorry for missing that) because after sampling a certain number of patches I want to check if the area of those patches has not exceeded certain value, say 10 % of the total area of the main global array. Also, thanks for giving a link to your another answer to a different, but related, question. — user11, Nov 19 '17 at 18:13
If you're interested, I updated the answer with one more code snippet to reflect what you're trying to do — Brad Solomon, Nov 19 '17 at 20:31
Also, if you have a very big image, you should look into `numpy.as_strided`. It's very efficient for forming different "windows" over an array but a little more complicated than basic indexing. — Brad Solomon, Nov 19 '17 at 20:33

Randomly sample sub-arrays from a 2D array in python

Problem:

Prospective (but partial) solution:

Question:

2 Answers2

Update based on the comments.