vectorized way to multiply and add specific axes in numpy array (convolutional layer backprop)

Question

I was wondering how it would be possible to vectorize the following quadruple for-loops (this is t do with backprop in a convolutional layer).

W = np.ones((2, 2, 3, 8)) # just a toy example
dW = np.zeros(W.shape)
dZ = np.ones((10, 4, 4, 8))*2
# get the shapes: m = samples/images; H_dim = Height of image; W_dim = Width of image; 8 = Channels/filters
(m, H_dim, W_dim, C) = dZ.shape 
dA_prev = np.zeros((10, 4, 4, 3))
# add symmetric padding of 2 around height and width borders with 0-values; shape is now: (10, 8, 8, 3)
dA_prev = np.pad(dA_prev,((0,0),(2,2),(2,2),(0,0)), mode='constant', constant_values = (0,0)) 

# loop over images
for i in range(m):
    # loop over height
    for h in range(H_dim):
        # loop over width
        for w in range(W_dim):
            # loop over channels/filters
            for c in range(C):
                vert_start = 1 * h # 1 = stride, just as an example
                vert_end = vert_start + 2 # 2 = vertical filter size, just as an example
                horiz_start = 1 * w # 1 = stride
                horiz_end = horiz_start + 2 # 2 = horizontal filter size, just as an example
                dW[:,:,:,c] += dA_prev[i, vert_start:vert_end,horiz_start:horiz_end,:] * dZ[i, h, w, c]
                dA_prev[i, vert_start:vert_end, horiz_start:horiz_end, :] += W[:,:,:,c] * dZ[i, h, w, c] # dZ[i, h, w, c] is a scalar

doing backprop on the bias is easy enough (db = np.sum(dZ, axis=(0,1,2), keepdims=True)), and the weights can be dealt with using stride tricks and by reshaping the dZ and then using the dot product the rescaled input (or tensordot on the axes or einsum).

def _striding(array, stride_size, filter_shapes, Wout=None, Hout=None):
    strides = (array.strides[0], array.strides[1] * stride_size, array.strides[2] * stride_size, array.strides[1], array.strides[2], array.strides[3])
    strided = as_strided(array, shape=(array.shape[0], Hout, Wout, filter_shapes[0], filter_shapes[1], array.shape[3]), strides=strides, writeable=False)
    return strided

Hout = (A_prev.shape[1] - 2) // 1 + 1
Wout = (A_prev.shape[2] - 2) // 1 + 1
x_flat = _striding(array=A_prev, stride_size=2, filter_shapes=(2,2),
                     Wout=Wout, Hout=Hout).reshape(-1, 2 * 2 * A_prev.shape[3])
dout_descendant_flat = dout_descendant.reshape(-1, n_C)
dW = x_flat.T @ dout_descendant_flat # shape (fh * fw * n_prev_C, C)
dW = dW.reshape(fh, fw, n_prev_C, C)

this gives identical results as dW in the slow version. but doing something similar to get the derivative wrt to the input that should yield the same result, doesn't. here's what i've done:

dZ_pad = np.pad(dZ,((0,0),(2,2),(2,2),(0,0)), mode='constant', constant_values = (0,0)) # padding to get the same shape as A_prev
dZ_pad_reshaped = _striding(array=dZ_pad, stride_size=1, filter_shapes=(2,2),
                     Wout=4, Hout=4) # the Hout and Wout dims are from the unpadded dims of A_prev
Wrot180 = np.rot90(W, 2, axes=(0,1)) # the filter height and width are in the first two axes, which we want to rotate
dA_prev = np.tensordot(dZ_pad_reshaped, Wrot180, axes=([3,4,5],[0,1,3]))

the shapes of dA_prev are right, but for some reason the results aren't identical as the slow version

Why do you want to do back-propagation manually using Python while many tools and library efficiently implement that? (a Numpy vectorized code will not be as efficient as carefully tuned libraries). Assuming you have a good reason, note that the memory access pattern to `dW` is inefficient. Please consider using an axis inversion or a transposition. — Jérôme Richard, May 25 '22 at 18:48
Thanks for your comment. I'm aware I can just let TensorFlow or PyTorch do the work for me. But I am using this as a sort of learning project to figure out more about numpy and also about CNNs — jackewiebohne, May 25 '22 at 18:59
It looks like a convolution so you can use things like `np.convolve` or the equivalent of Scipy (the hundreds of past questions about convolutions in Numpy should help). — Jérôme Richard, May 25 '22 at 20:20
The thing with np.convolve is that it only allows convolutions with 2D arrays, mine are 4D. I've been trying to think if einsum could do it, but I'm not sure how. I'll have to look more closely at the scipy one tomorrow (though I'd still be curious about a solution in pure numpy in any case). — jackewiebohne, May 25 '22 at 21:15
The one of Scipy is less restrictive (see [here](https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.convolve.html)). There are many other way to do convolution. See Numpy stride tricks and ND FFT for example. — Jérôme Richard, May 25 '22 at 21:41
hm.. scipy is less restrictive, true, but since it expects two arrays of the same dimension (which mine aren't) and doesn't let you choose axes (that goes for convolve and correlate), it seems i'd still have to use loops (as far as I understand, there would still have to be 3 loops with my given dimensions) to get the right slices to convolve or correlate. — jackewiebohne, May 27 '22 at 12:02
Convolutions with a kernel having different array dimensions would be very weird to me. You can use a 1x1xN, 1xNx1 or Nx1x1 kernel to perform a 1D convolution along the different axis of a 3D array and the same thing applies for a 2D convolution on a 4D array. I see no reason it would not work and even if it would not, the FFT solution should be flexible enough to solve this problem. It can even deal with wide sparse kernels though it is only interesting for large kernels performance wise. — Jérôme Richard, May 27 '22 at 12:55
my goal wasn't to convolve on non-matching kernels, but rather to slice the different sized arrays in the right way so that (perfectly conventional) convolutions could take place. anyway, i've found a partial solution with stride tricks, but it's still not quite right. i've edited the question. the derivative wrt to W works (using stride tricks), but there must still be a mistake in how i use stride tricks to get the right derivative for the input. but i can't find it — jackewiebohne, Jun 04 '22 at 16:17

score 0 · Accepted Answer · answered Jun 09 '22 at 11:27

OK, turns out the error was to do with several things:

dZ needed to be dilated relative to the stride in the forward propagation
the window function used for windowing dZ (done after dilation of dZ) needed to be called with stride 1 (no matter the stride choice in the forward propagation) with the output heights and widths of the padded input (not the original, unpadded input -- this was the main mistake that took me days to debug)

the relevant code is below with comments explaining shapes and operations as well as some further sources for reading. i've also included the forward propagation. i should note that after days of debugging, writing various functions, reading etc. the variable names changed after a while, so for the ease of reading, here are the names of the variables as defined in my question and then their equivalent in the code below:

A_prev is x dZ is dout_descendant Hout is the height of dout_descendant Wout is the width of dout_descendant

(as one would expect all references to self are to the class these functions are part of)

    def _pad(self, array, pad_size, pad_val):
        '''
        only symmetric padding is implemented
        ''' 
        return np.pad(array, ((0, 0), (pad_size, pad_size), (pad_size, pad_size), (0, 0)), 'constant', constant_values=(pad_val, pad_val))


    def _dilate(self, array, stride_size, pad_size, symmetric_filter_shape, output_image_size):
        # on dilation for backprop with stride>1, 
        # see: https://medium.com/@mayank.utexas/backpropagation-for-convolution-with-strides-8137e4fc2710
        # see also: https://leimao.github.io/blog/Transposed-Convolution-As-Convolution/
        pad_bottom_and_right = (output_image_size + 2 * pad_size - symmetric_filter_shape) % stride_size
        for m in range(stride_size - 1):
            array = np.insert(array, range(1, array.shape[1], m + 1), 0, axis=1)
            array = np.insert(array, range(1, array.shape[2], m + 1), 0, axis=2)

        for _ in range(pad_bottom_and_right):
            array = np.insert(array, array.shape[1], 0, axis=1)
            array = np.insert(array, array.shape[2], 0, axis=2)
        return array


    def _windows(self, array, stride_size, filter_shapes, out_height, out_width):
        '''
        inputs:
            array to create windows of
            stride_size: int
            filter_shapes: tuple(int): tuple of filter height and width
            out_height and out_width: int, respectively: output sizes for the windows
        returns:
            windows of array with shape (excl. dilation): 
                array.shape[0], out_height, out_width, filter_shapes[0], filter_shapes[1], array.shape[3]
        '''
        strides = (array.strides[0], array.strides[1] * stride_size, array.strides[2] * stride_size, array.strides[1], array.strides[2], array.strides[3])
        return np.lib.stride_tricks.as_strided(array, shape=(array.shape[0], out_height, out_width, filter_shapes[0], filter_shapes[1], array.shape[3]), strides=strides, writeable=False)


    def forward(self, x):
        '''
        expects inputs to be of shape: [batchsize, height, width, channel in]
        after init, filter_shapes are: [fh, fw, channel in, channel out] 
        '''
        self.input_shape = x.shape
        x_pad = self._pad(x, self.pad_size, self.pad_val)
        self.input_pad_shape = x_pad.shape
        # get the shapes
        batch_size, h, w, Cin = self.input_shape
        # calculate output sizes; only symmetric padding is possible
        self.Hout = (h + 2*self.pad_size - self.fh) // self.stride + 1
        self.Wout = (w + 2*self.pad_size - self.fw) // self.stride + 1
        x_windows = self._windows(array=x_pad, stride_size=self.stride, filter_shapes=(self.fh, self.fw),
                        out_width=self.Wout, out_height=self.Hout) # 2D matrix with shape (batch_size, Hout, Wout, fh, fw, Cin)
        self.out = np.tensordot(x_windows, self.w, axes=([3,4,5], [0,1,2])) + self.b
        self.inputs = x_windows
        ## alternative 1: einsum approach, slower than other alternatives
        # self.out = np.einsum('noufvc,fvck->nouk', x_windows, self.w) + self.b
        ## alternative 2: column approach with simple dot product
        # z = x_windows.reshape(-1, self.fh * self.fw * Cin) @ self.W.reshape(self.fh*self.fw*Cin, Cout) + self.b # 2D matrix with shape (batch_size * Hout * Wout, Cout)
        # self.dout = z.reshape(batch_size, Hout, Wout, Cout)

    def backward(self,dout_descendant):
        '''
        dout_descendant has shape (batch_size, Hout, Wout, Cout)
        ''' 
        # get shapes
        batch_size, h, w, Cin = self.input_shape
        # we want to sum everything but the filters for b
        self.db = np.sum(dout_descendant, axis=(0,1,2), keepdims=True) # shape (1,1,1, Cout)
        # for dW we'll use the column approach with ordinary dot product for variety ;) tensordot does the same without all the reshaping
        dout_descendant_flat = dout_descendant.reshape(-1, self.Cout) # new shape (batch_size * Hout * Wout, Cout)
        x_flat = self.inputs.reshape(-1, self.fh * self.fw * Cin) # shape (batch_size * Hout * Wout, fh * fw * Cin)
        dw = x_flat.T @ dout_descendant_flat # shape (fh * fw * Cin, Cout)
        self.dw = dw.reshape(self.fh, self.fw, Cin, self.Cout)
        del dout_descendant_flat # free memory
        # for dinputs: we'll get padded and dilated windows of dout_descendant and perform the tensordot with 180 rotated W
        # for details, see https://medium.com/@mayank.utexas/backpropagation-for-convolution-with-strides-8137e4fc2710 ; also: https://pavisj.medium.com/convolutions-and-backpropagations-46026a8f5d2c ; also: https://youtu.be/Lakz2MoHy6o?t=835
        Wrot180 = np.rot90(self.w, 2, axes=(0,1)) # or also self.w[::-1, ::-1, :, :]
        # backprop for forward with stride > 1 is done on windowed dout that's padded and dilated with stride 1
        dout_descendant = self._dilate(dout_descendant, stride_size=self.stride, pad_size=self.pad_size, symmetric_filter_shape=self.fh, output_image_size=h)
        dout_descendant = self._pad(dout_descendant, pad_size=self.fw-1, pad_val=self.pad_val) # pad dout_descendant to dim: fh-1 (or fw-1); only symmetrical filters are supported
        dout_descendant = self._windows(array=dout_descendant, stride_size=1, filter_shapes=(self.fh, self.fw),
                            out_height=h + 2 * self.pad_size, out_width=w + 2 * self.pad_size) # shape: (batch_size * h_padded * w_padded, fh * fw * Cout)
        self.dout = np.tensordot(dout_descendant, Wrot180, axes=([3,4,5],[0,1,3]))
        self.dout = self.dout[:,self.pad_size:-self.pad_size, self.pad_size:-self.pad_size, :]
        ## einsum alternative, but slower:
        # dinput = np.einsum('nhwfvk,fvck->nhwc', dout_windows, self.W)

i've left this answer here, because all the other sources on stackoverflow or github i could find that used numpy stride tricks were implemented for convolutions of stride 1 (which doesn't require dilation of dZ) or they used very complex fancy indexing operations that were extremely hard to follow (e.g. https://sgugger.github.io/convolution-in-depth.html#convolution-in-depth or https://github.com/parasdahal/deepnet/blob/51a9e61c351138b7dc637f4b748a0e6ca2e15595/deepnet/im2col.py)

vectorized way to multiply and add specific axes in numpy array (convolutional layer backprop)

1 Answers1