I need to perform a convolution on an image in-place, and by in-place I mean such that as the structuring element gets applied to different pixels, I want the result of the previous steps to overwrite the image. To put it in context, this is useful in Gauss-Seidel iteration.
I'm currently using scipy.ndimage.convolve1d
, which clearly doesn't do in-place convolution as I described. I know how to write a kernel that does that using numba
(basically a for-loop where you overwrite the existing elements as you iterate on pixels), but I was wondering if I could get further speedups on GPU. The issue is that since the array being operated on changes in each iteration, it's not trivial to code it in GPU because of race conditions.
Here's a concrete example:
a = [1, 5, 3, 0, -4, 1]
weights = [1, 2, 1]
Here's what scipy.ndimage.convolve1d
does (assume out
is the result, also assume 0 for values extending the boundaries):
# Step 1: 1*0 + 2*1 + 1*5 = 7 -> out[0], a = [1, 5, 3, 0, -4, 1]
# Step 2: 1*1 + 2*5 + 1*3 = 14 -> out[1], a = [1, 5, 3, 0, -4, 1]
# Step 3: 1*5 + 2*3 + 1*0 = 12 -> out[2], a = [1, 5, 3, 0, -4, 1]
# ...
Here's what I want:
# Step 1: 1*0 + 2*1 + 1*5 = 7 -> a[0], a = [7, 5 , 3 , 0, -4, 1]
# Step 2: 1*7 + 2*5 + 1*3 = 20 -> a[1], a = [7, 20, 3 , 0, -4, 1]
# Step 3: 1*20 + 2*3 + 1*0 = 26 -> a[2], a = [7, 20, 26, 0, -4, 1]
# ...
Here's my numba
implementation for a 2D convolution with the following weights:
w = [[0 1 0],
[1 -4 1],
[0 1 0]]
@numba.njit()
def myconv(u):
Nx, Ny = u.shape
for i in range(1, Nx-1):
for j in range(1, Ny-1):
u[i, j] = u[i-1, j] + u[i+1, j] + u[i, j-1] + u[i, j+1] - 4*u[i, j]