How to apply convolution in-place on GPU (in Python)

Question

I need to perform a convolution on an image in-place, and by in-place I mean such that as the structuring element gets applied to different pixels, I want the result of the previous steps to overwrite the image. To put it in context, this is useful in Gauss-Seidel iteration.

I'm currently using scipy.ndimage.convolve1d, which clearly doesn't do in-place convolution as I described. I know how to write a kernel that does that using numba (basically a for-loop where you overwrite the existing elements as you iterate on pixels), but I was wondering if I could get further speedups on GPU. The issue is that since the array being operated on changes in each iteration, it's not trivial to code it in GPU because of race conditions.

Here's a concrete example:

a = [1, 5, 3, 0, -4, 1]
weights = [1, 2, 1]

Here's what scipy.ndimage.convolve1d does (assume out is the result, also assume 0 for values extending the boundaries):

# Step 1: 1*0 + 2*1 + 1*5 = 7  -> out[0], a = [1, 5, 3, 0, -4, 1]
# Step 2: 1*1 + 2*5 + 1*3 = 14 -> out[1], a = [1, 5, 3, 0, -4, 1]
# Step 3: 1*5 + 2*3 + 1*0 = 12 -> out[2], a = [1, 5, 3, 0, -4, 1]
# ...

Here's what I want:

# Step 1: 1*0 + 2*1 + 1*5 = 7   -> a[0], a = [7, 5 , 3 , 0, -4, 1]
# Step 2: 1*7 + 2*5 + 1*3 = 20  -> a[1], a = [7, 20, 3 , 0, -4, 1]
# Step 3: 1*20 + 2*3 + 1*0 = 26 -> a[2], a = [7, 20, 26, 0, -4, 1]
# ...

Here's my numba implementation for a 2D convolution with the following weights:

w = [[0  1  0], 
     [1 -4  1],
     [0  1  0]]

@numba.njit()
def myconv(u):
    Nx, Ny = u.shape
    for i in range(1, Nx-1):
        for j in range(1, Ny-1):
            u[i, j] = u[i-1, j] + u[i+1, j] + u[i, j-1] + u[i, j+1] - 4*u[i, j]

Why can't you overwrite the image with the result of the convolution? — mozway, Sep 25 '22 at 18:54
Well, I can, it's just that I can overwrite the image once the convolution is finished, but that's not what I want. I want the image to be overwritten "during" the convolution, meaning that as the convolution gets applied to each pixel, previous pixels are already changed so that the future steps are performed on an altered version of the image. With `scipy.ndimage.convolve1d`, the convolution is applied on the original image all along. — Blademaster, Sep 25 '22 at 19:38
That kind of iterative convolution has to be done with an explicit loop - in Python, or with the aid of a compiling tool like `numba`. The tools that `numpy` and `scipy` provide don't work that way. — hpaulj, Sep 26 '22 at 04:52
@hpaulj Yeah, I already implemented one with `numba` and it works fine, it's just that I wish there was a GPU implementation for even more speedups. — Blademaster, Sep 27 '22 at 02:19
Why didn't you mention numba and gpu in the first place? You are much further along on this problem than almost anyone else (especially any who might be hanging around SO during the narrow 24 hr window when your question is most visible.) — hpaulj, Sep 27 '22 at 07:18
@hpaulj You're right, I just made further edits to the post and the title. — Blademaster, Sep 27 '22 at 09:46
I'd suggest deleting this, and posting a new question. Give some more thought to the tags. — hpaulj, Sep 27 '22 at 12:56

Bob · Answer 1 · 2022-09-28T08:26:35.380

1

If I understood correctly you are trying to run a 2D infinite filter response filter.

If your filter is separable you can do that by applying lfilter, once each axis. Also, this method will shift the output maybe preferrable to use filtfilt.

Edit

The example given by the OP is not separable to begin with, however it is simple enough to make the computation reasonable using cumulative sum.

This is the original implementation

def myconv(u):
    Nx, Ny = u.shape
    for i in range(1, Nx-1):
        for j in range(1, Ny-1):
            u[i, j] = (u[i-1, j] + u[i+1, j] + u[i, j-1] + u[i, j+1] - 4*u[i, j])

Notice that in the j loop the only modified value in the sum u[i,j-1], that is added to u[i,j], this operation could be deferred and performed in a second loop without changing the output.

def myconv_worse(u):
    Nx, Ny = u.shape
    for i in range(1, Nx-1):
        for j in range(1, Ny-1):
            u[i, j] = (u[i-1, j] + u[i+1, j] + u[i, j+1] - 4*u[i, j])
        for j in range(1, Ny-1):
            u[i, j] += u[i,j-1]

Then, notice that the second loop could be calculated using cummulative sum

def myconv_vec(u):
    Nx, Ny = u.shape
    for i in range(1, Nx-1):
        u[i,1:-1] = u[i-1,1:-1] + u[i+1,1:-1] + u[i,2:] - 4 * u[i,1:-1]
        u[i,:-1] = u[i,:-1].cumsum()

This vectorizes one axis.

This is fully compatible with pytorch that support GPU out of the box.

import torch;
u = torch.randn(10,10, device='cuda')
t1 = u.clone()
t2 = u.clone()
myconv_vec(t1)
myconv(t2)
print(t1 - t2)

edited Sep 28 '22 at 08:26

answered Sep 26 '22 at 10:16

Bob

13,867
1
5
27

I just edited my post and added a concrete example, is this what `lfilter` does? Thanks! – Blademaster Sep 26 '22 at 22:32
No it is not, and your filter is also not even separable – Bob Sep 27 '22 at 13:53
Now I added a custom code for the specific example, just to have an acceptable answer. – Bob Sep 27 '22 at 19:00
Thank you so much. One thing though: the output from `myconv_mat` doesn't seem to match with that of `myconv`. I'd really appreciate it if you could double check. – Blademaster Sep 28 '22 at 05:23
Ohh you are right, the reasoning don't extend for the second axis, why? Could you try to find what is different in this case? – Bob Sep 28 '22 at 08:29
I spent some time trying to figure it out, but to no avail... Anyway, really appreciate your efforts in helping me out. – Blademaster Sep 29 '22 at 00:11

How to apply convolution in-place on GPU (in Python)

1 Answers1

Edit