Increasing speed of a pure Numpy/Scipy convolutional neural network implementation

Question

Background

I've trained up a convolutional neural network which I would like others to be able to use without requiring hard to install libraries such as Theano (which I have found trivial to install on Linux, but very hard on Windows).

I've written an implementation using Numpy/Scipy that is almost fast enough, but would be even better if it was two or three times faster.

What I've tried

90% of the time is spent in the following line:

conv_out = np.sum([scipy.signal.convolve2d(x[i],W[f][i],mode='valid') for i in range(num_in)], axis=0)

This line gets called 32 times (once for each feature map), and num_in is 16 (the number of features in the previous layer). So overall this line is slow as it results in 32*16=512 calls to the convolve2d routine.

x[i] is only 25*25, and W[f][i] is 2*2.

Question

Is there a better way of expressing this type of convolutional layer in Numpy/Scipy that would execute faster?

(I am only using this code for applying a learnt network so I do not have lots of images to be done in parallel.)

Code

The full code for doing a timing experiment is:

import numpy as np
import scipy.signal
from time import time

def max_pool(x):
    """Return maximum in groups of 2x2 for a N,h,w image"""
    N,h,w = x.shape
    return np.amax([x[:,(i>>1)&1::2,i&1::2] for i in range(4)],axis=0)

def conv_layer(params,x):
    """Applies a convolutional layer (W,b) followed by 2*2 pool followed by RelU on x"""
    W,biases = params
    num_in = W.shape[1]
    A = []
    for f,bias in enumerate(biases):
        conv_out = np.sum([scipy.signal.convolve2d(x[i],W[f][i],mode='valid') for i in range(num_in)], axis=0)
        A.append(conv_out + bias)
    x = np.array(A)
    x = max_pool(x)
    return np.maximum(x,0)

W = np.random.randn(32,16,2,2).astype(np.float32)
b = np.random.randn(32).astype(np.float32)
I = np.random.randn(16,25,25).astype(np.float32)

t0 = time()
O = conv_layer((W,b),I)
print time()-t0

This prints 0.084 seconds at the moment.

Update

Using mplf's suggestion:

d = x[:,:-1,:-1]
c = x[:,:-1,1:]
b = x[:,1:,:-1]
a = x[:,1:,1:]
for f,bias in enumerate(biases):
    conv_out = np.sum([a[i]*W[f,i,0,0]+b[i]*W[f,i,0,1]+c[i]*W[f,i,1,0]+d[i]*W[f,i,1,1] for i in range(num_in)], axis=0)

I get 0.075s, which is slightly faster.

+1 for an intriguing question however I'm seeing an average of 0.037s over 20 runs on this. What baseline speed are you aiming for? — mproffitt, Aug 12 '15 at 18:56
I have 3 similarly sized networks I would like to run on a live camera input. It does actually work at the moment (at around 2 fps) but I would like to push it up - ideally to over 10 fps. Perhaps you have a faster computer? — Peter de Rivaz, Aug 12 '15 at 18:59
I get `0.024` on the first run, and then `0.018-0.020` on a laptop i5 4200M, with python 3.4 under windows 10, with enough stuff running in the background to keep my CPU at 20%. What hardware / OS do you have? — IVlad, Aug 12 '15 at 19:21
Fairly old laptop on windows 7, I am more interested in relative gains than absolute speed if possible — Peter de Rivaz, Aug 12 '15 at 19:25
Can you give me some hints on the `(i>>1)&1::2, i&1::2` part? I can see the final effect is to sample 2x2 grids with a stride of 2, but don't quite get the bitwise part. How to generalize to a max of `nxn` with a stride of `s`? — Jason, Mar 02 '18 at 07:37
@Jason For larger grids the equivalent is getting the y,x coordinates from divmod(i,n). However it would probably be clearer to simply use two for loops, one for x and one for y. — Peter de Rivaz, Mar 02 '18 at 08:18
@PeterdeRivaz Got it. Just curious, how do you write the `n=3` case using the bitwise approach? — Jason, Mar 02 '18 at 08:49
@Jason `i//3::2,i%3::2` should be equivalent, but I don't know if you can use bitwise operations to improve the speed further — Peter de Rivaz, Mar 02 '18 at 09:49

Peter de Rivaz · Answer 1 · 2015-08-13T07:41:22.950

8

Accelerating convolution

Building on mplf's suggestion I've found it is possible to remove both of the for loops and the call to convolve2d:

d = x[:,:-1,:-1].swapaxes(0,1)
c = x[:,:-1,1:].swapaxes(0,1)
b = x[:,1:,:-1].swapaxes(0,1)
a = x[:,1:,1:].swapaxes(0,1)
x = W[:,:,0,0].dot(a) + W[:,:,0,1].dot(b) + W[:,:,1,0].dot(c) + W[:,:,1,1].dot(d) + biases.reshape(-1,1,1)

This is 10 times faster than the original code.

Accelerating max pool

With this new code, the max pool stage now takes 50% of the time. This can also be sped up by using:

def max_pool(x):
    """Return maximum in groups of 2x2 for a N,h,w image"""
    N,h,w = x.shape
    x = x.reshape(N,h/2,2,w/2,2).swapaxes(2,3).reshape(N,h/2,w/2,4)
    return np.amax(x,axis=3)

This speeds up the max_pool step by a factor of 10, so overall the program doubles in speed again.

edited Aug 13 '15 at 07:41

answered Aug 12 '15 at 21:37

Peter de Rivaz

33,126
4
46
75

I was about to suggest an equivalent generalization: `np.sum(a*W[f,:,0,0][...,None,None]+b*W[f,:,0,1][...,None,None]+c*W[f,:,1,0][...,None,None]+d*W[f,:,1,1][...,None,None], axis=0) ` – hpaulj Aug 12 '15 at 22:04
2

An equivalent formulation of `max_pool` is the following which only needs one `reshape` and no `swapaxes`: `return x.reshape(N, h / 2, 2, w / 2, 2).max(axis=(2, 4))` – jdoerrie Jun 16 '16 at 12:57

score 6 · Accepted Answer · edited May 23 '17 at 10:31

6

Looking around, it seems that the scipy convolve2d function is unoptimized and rather in-efficient. There is an open issue on this from Jan 2014 (https://github.com/scipy/scipy/issues/3184) and this question seems to be related Improving Numpy Performance.

I would suggest trying the solution posted by Theran and see if this produces any better performance first.

edited May 23 '17 at 10:31

Community

1
1

answered Aug 12 '15 at 19:26

mproffitt

2,409
18
24

Thanks I'll try it out in the morning – Peter de Rivaz Aug 12 '15 at 19:30
I assume this method would still involve doing around 512*4 matrix operations - do you think there is a way of reducing these to a smaller number of larger matrix operations which I would expect to be faster? – Peter de Rivaz Aug 12 '15 at 19:38
`specialconvolve` expression is all linear operations, so it should be easy to generalize to include the `num_in` dimension. If needed, as about it in another question. – hpaulj Aug 12 '15 at 21:31
But `specialconvolve` is designed for a particular kernel, how do u use it for kernels of other sizes and values? – Jason Mar 02 '18 at 06:47