What is the fastest way to compare patches of an array?

Question

I want to compare different areas of a 2 dimensional array $A$ with a given array $b$ of a smaller size. Since I have to do it a lot of times, it is crucial that this is performed very fast. I have a solution that works fine, but I hoped it could be done nicer and faster.

In detail:

Let's say we have a big array and a small array. I loop over all possible 'patches' within the big array that have the same size as the small array and compare this patches with the given small array.

def get_best_fit(big_array, small_array):

    # we assume the small array is square
    patch_size = small_array.shape[0]
    min_value = np.inf
    for x in range(patch_size, big_array.shape[0] - patch_size):
        for y in range(patch_size, big_array.shape[1] - patch_size):
            p = get_patch_term(x, y, patch_size, big_array)
            tmp = some_metric(p, small_array)
            if min_value > tmp:
                min_value = tmp
                min_patch = p

    return min_patch, min_value

In order to get the patches I got this direct array access implementation:

def get_patch_term(x, y, patch_size, data):
    """
    a patch has the size (patch_size)^^2
    """
    patch = data[(x - (patch_size-1)/2): (x + (patch_size-1)/2 + 1),
                 (y - (patch_size-1)/2): (y + (patch_size-1)/2 + 1)]
    return patch

I guess that this is the most crucial task and can be performed faster but I am not sure about it.

I had a look into Cython but maybe I did it wrong, I am not really familiar with it.

My first attempt was a direct translation into cython:

def get_patch_term_fast(Py_ssize_t x_i, Py_ssize_t y_i,
                        Py_ssize_t patch_size,
                        np.ndarray[DTYPE_t, ndim=2] big_array):

    assert big_array.dtype == DTYPE
    patch_size = (patch_size - 1)/2

    cdef np.ndarray[DTYPE_t, ndim=2] patch = <np.ndarray[DTYPE_t, ndim=2]>big_array[(x_i - patch_size):(x_i + patch_size + 1), (y_i - patch_size): (y_i + patch_size + 1)]
    return patch

And this seems to be faster (see below) but I thought that a parallel approach should be better, so I came up with this

def get_patch_term_fast_parallel(Py_ssize_t x_i, Py_ssize_t y_i,
                                 Py_ssize_t patch_size,
                                 np.ndarray[DTYPE_t, ndim=2] big_array):

    assert big_array.dtype == DTYPE
    patch_size = (patch_size - 1)/2

    assert big_array.dtype == DTYPE
    cdef Py_ssize_t x
    cdef Py_ssize_t y


    cdef np.ndarray[DTYPE_t, ndim=1] patch = np.empty(np.power((2 * patch_size) + 1, 2))
    with nogil, parallel():
        for x in prange(x_i - patch_size, x_i + patch_size + 1):
            for y in prange(y_i - patch_size, y_i + patch_size + 1):
                patch[((x - (x_i - patch_size)) * (2 * patch_size + 1)) + (y - (y_i - patch_size))] = big_array[x, y]
    #cdef np.ndarray[DTYPE_t, ndim=2] patch = <np.ndarray[DTYPE_t, ndim=2]>big_array[(x_i - patch_size):(x_i + patch_size + 1), (y_i - patch_size): (y_i + patch_size + 1)]
    return patch

Which is, unfortunately, not faster. For testing I used:

A = np.array(range(1200), dtype=np.float).reshape(30, 40)
b = np.array([41, 42, 81, 84]).reshape(2, 2)

x = 7
y = 7
print(timeit.timeit(lambda:get_patch_term_fast(x, y, 3, A), number=300))
print(timeit.timeit(lambda:get_patch_term_fast_parallel(x, y, 3, A).reshape(3,3), number=300))
print(timeit.timeit(lambda:get_patch_term(x, y, 3, A), number=300))

Which gives

0.0008792859734967351
0.0029909340664744377
0.0029337930027395487

So, my first question is, is it possible to do it faster? The second question would be, why is the parallel approach not faster than the original numpy implementation?

Edit:

I tried to further parallelize the get_best_fit function but unfortunately I get a lot of errors stating that I can not assign a Python object without gil.

Here is the code:

def get_best_fit_fast(np.ndarray[DTYPE_t, ndim=2] big_array,
                      np.ndarray[DTYPE_t, ndim=2] small_array):

    # we assume the small array is square
    cdef Py_ssize_t patch_size = small_array.shape[0]

    cdef Py_ssize_t x
    cdef Py_ssize_t y

    cdef Py_ssize_t x_range = big_array.shape[0] - patch_size
    cdef Py_ssize_t y_range = big_array.shape[1] - patch_size

    cdef np.ndarray[DTYPE_t, ndim=2] p
    cdef np.ndarray[DTYPE_t, ndim=2] weights = np.empty((x_range - patch_size)*(y_range - patch_size)).reshape((x_range - patch_size), (y_range - patch_size))

    with nogil, parallel():
        for x in prange(patch_size, x_range):
            for y in prange(patch_size, y_range):
                p = get_patch_term_fast(x, y, patch_size, big_array)
                weights[x - patch_size, y - patch_size] = np.linalg.norm(np.abs(p - small_array))

    return np.min(weights)

PS: I omitted the part of returning the smallest patch...

Depending on `some_metric` cross-correlation (or something similar, depending on the metric) using fast Fourier transforms might well be faster. — DavidW, Jun 16 '15 at 12:18
In any case, taking simple slices in numpy is really quite efficient (it doesn't even involving copying anything) so you're unlikely to beat it in Cython. You might have more luck applying Cython to the loops in `get_best_fit`. — DavidW, Jun 16 '15 at 12:20
Unfortunately, a one-to-one translation of the get_best_fit function into Cython gives no speed advantage. And I don't get it to work parallel. Even though it should work theoretically, the assignment of objects within the parallel loop gives me trouble. — mijc, Jun 16 '15 at 14:39
For the parallel loop you might try making `min_value` and `min_patch` as numpy arrays with a length given by the number of threads, and then assigning to index `cython.parallel.threadid()` within the loop. (You'd then have to pick the best of those after the loop). Unless your problem with assignment is with something else... — DavidW, Jun 16 '15 at 14:46
I think my problem goes further than that. I edited my question in order to explain it in more detail. And by the way, thank you very much for your help so far. — mijc, Jun 16 '15 at 15:08
I think it's going to be difficult to make it work. You need all functions called to be `nogil` and while you could probably make `get_patch_term_fast` nogil, you won't be able to change `np.linalg.norm`. My final suggestion is that you might find it easier to use the standard library threading or multiprocessing modules rather than Cython (you don't need to be as thorough about making everything `nogil`). However, I think I've really run out of useful advice here. — DavidW, Jun 16 '15 at 15:51
Well, I didn't know that all functions that are called need to be `nogil` and in reality I don't use `np.linalg.norm` but my own metric. So this might actually help a lot. I will have a look at it tomorrow. Thank you! — mijc, Jun 16 '15 at 18:23
@DavidW BTW, `np.linalg.norm` is just a wrapper around the corresponding BLAS code, so you could call it in pure C with `nogil`. To make things easier, [Cython API for BLAS and LAPACK](http://comments.gmane.org/gmane.comp.python.cython.user/13033) will be available in the next version of scipy. But I agree that making all of it nogil would take some effort in this case and might not be worth it. — rth, Jun 17 '15 at 00:24
Depending on the structure of your data it could be faster to downsample the patch to be a scalar (if square) or a vector, solve the problem on the equivalently downsampled big array, and then move to progressively less-downscaled versions of the problem that focus on the regions IDed by the downsampling. Kind a dynamic programming. But it depends on the downsampling not ruining the structure of your data. — Curt F., Jul 23 '15 at 18:30
I would suggest adapting your `get_best_fit_fast`, but instead of calling `get_patch_term` from within your innermost loop, replace this line with `p = big_array[(x -...`. (it's a one liner anyway). This will save you a bunch of function call overheads. — John Lyon, Aug 03 '15 at 22:16
Can you explain what is in some_metric, because if it is some common function then there could be other way to implement. — , Aug 07 '15 at 10:01

score 1 · Answer 1 · answered Aug 24 '15 at 09:28

I think dependant on what your some_metric function does, it's possible there is already a highly-optimized implementation out there. For example if it is just a convolution then take a look at Theano which will even let you leverage GPU quite easily. Even if your function isn't quite as simple as a straightforward convolution it is likely there'll be building blocks within Theano you can use rather than trying to go really low-level with Cython.

Thanks. Will have a look into theano. – mijc Aug 28 '15 at 08:58 — mijc, Aug 28 '15 at 08:58

Nikos M. · Answer 2 · 2015-08-20T01:58:15.170

initialy posted another answer based on pattern matching (carried away by the title), post another answer

use one loop to loop over both arrays (big and small) and apply partial correlation metric (or whatever) without slicing lists all the time:

as a sidenote, observe the fact (depending on metric of course) that once one patch has been completed, the next patch (either one row down or one column right) shares much with the previous patch, only its initial and final rows (or columns) change thus the new correlation can be computed faster from the previous correlation by subtracting previous row and adding new row (or vice-versa). Since metric function is not given is added as observation.

def get_best_fit(big_array, small_array):

    # we assume the small array is square
    patch_size = small_array.shape[0]
    x = 0 
    y = 0
    x2 = 0 
    y2 = 0
    W = big_array.shape[0]
    H = big_array.shape[1]
    W2 = patch_size
    H2 = patch_size
    min_value = np.inf
    tmp = 0
    min_patch = None
    start = 0 
    end = H*W
    start2 = 0
    end2 = W2*H2
    while start < end:

        tmp = 0
        start2 = 0
        x2 = 0 
        y2 = 0
        valid = True
        while start2 < end2:
            if x+x2 >= W or y+y2 >= H: 
                valid = False
                break
            # !!compute partial metric!!
            # no need to slice lists all the time
            tmp += some_metric_partial(x, y, x2, y2, big_array, small_array)
            x2 += 1 
            if x2>=W2: 
                x2 = 0 
                y2 += 1
            start2 += 1

        # one patch covered
        # try next patch
        if valid and min_value > tmp:
            min_value = tmp
            min_patch = [x, y, W2, H2]

        x += 1 
        if x>=W: 
            x = 0 
            y += 1

        start += 1

    return min_patch, min_value

score 0 · Answer 3 · answered Sep 23 '15 at 18:53

The other issue with your time measurement of the parallel function is that you call reshape on your array object after you run your parallel function. It could be the case that the parallel function is faster, but then reshape is adding extra time (although reshape should be quite fast, but I'm not sure how fast).

The other issue is we don't know what your some_metric term is, and it doesn't appear that you are using that in parallel. The way I see your parallel code working is that you are getting the patches in parallel and then putting them in a queue to be analyzed by the some_metric function, hence defeating the purpose of the parallelization of your code.

As John Greenhall said, using FFTs and convolutions can be quite fast. However, to take advantage of parallel processing, you would still need to do the analysis of the patch and small array in parallel as well.

How big are the arrays? Is memory an issue here as well?

What is the fastest way to compare patches of an array?

3 Answers3