Generalized sliding-window computation on the GPU

Question

Here's some Python code that implements a sliding-window computation on two 3D matrices, X and Y.

import numpy

def sliding_dot( X,Y ) :

    assert X.ndim == Y.ndim == 3
    iw,ih,id = X.shape
    fw,fh,fd = Y.shape

    assert id == fd
    assert fw < iw and fh < ih

    ow,oh = iw-fw+1,ih-fh+1
    out = numpy.zeros( [ow,oh] )

    for x in xrange(ow) :
        for y in xrange(oh) :
            window = X[x:x+fw,y:y+fh,:]
            out[x,y] = numpy.dot( window.flatten(),Y.flatten() )

    return out

#################    

A_dims = (640,480,32)
B_dims = (6,6,32)

A = numpy.random.rand(*A_dims)
B = numpy.random.rand(*B_dims)

sliding_dot(A,B)

In general, Y is always much smaller than X along the first and second dimensions, but they are equal in the third dimension.

Note that we could replace numpy.dot() with any function of Y and the window. This is a little bit different than convolution in that Y only slides along the first and second dimensions of X. I'm looking for an effective strategy for implementing this kind of sliding window computation, efficiently, using CUDA. Anybody want to offer me some direction? Cheers!

Update : You can watch me work through the optimization process with help from other users in my answer, below.

talonmies · Answer 1 · 2011-10-11T11:59:21.763

Trying to design a "generalised" implementation which could accommodate just about any operation you might want is going to be an enormous trade off in an architecture like CUDA. For your concrete dot product example, which is a typical reduction operation, this is a pretty useful implementation:

__constant__ int ldaX[3];
__constant__ int ldaY[3];
__constant__ int dimX[3];
__constant__ int dimY[3];

template<typename real,int blocksize>
__global__ void sliding_k(const real *X, const real *Y, real *out)
{
    __shared__ volatile real buffer[blocksize];

    int tid = threadIdx.x;
    int gid = blockIdx.x * gridDim.y + blockIdx.y;

    real value = (real)0;
    int xpos = (blockIdx.y * ldaX[2]) + (blockIdx.x * ldaX[1]);
    int ypos = 0;
    for(int i=0; i<dimY[0]; i++) {
        for(int jk=tid; jk<ldaY[1]; jk+=blocksize) {
            value += X[xpos+jk] * Y[ypos+jk];
        }
        xpos += ldaX[1];
        ypos += ldaY[1];
    }

    buffer[tid] = value;
    __syncthreads();

# pragma unroll
    for(int i=(tid+32); ((tid<32)&&(i<blocksize)); i+=32)
        buffer[tid] += buffer[i];

    if (tid < 16) buffer[tid] += buffer[tid + 16];
    if (tid < 8)  buffer[tid] += buffer[tid + 8];
    if (tid < 4)  buffer[tid] += buffer[tid + 4];
    if (tid < 2)  buffer[tid] += buffer[tid + 2];
    if (tid == 0) out[gid] = buffer[0] + buffer[1];
}

You could substitute any kind of reduction operator you like for the floating point multiply add/summation operation which a dot product uses and the code should work OK. Each window calculation is performed by a single block. There is enough parallel work to justify at this window size a block per window. This allows coalesced global memory access, and on Fermi cards, a good amount of L1 cache hits.

Here I have only build one assumption into the code, that being that the third dimension of the source array and the window array are equal. This allows the inner two loops to be "fused" into a single operation because the common memory layout they share. Running a test harness in Python using an improved version of your reference code, with the host code written in PyCUDA, I get this:

In [15]: %timeit -n3 -r3 out2=sliding_cuda(A,B)
3 loops, best of 3: 49.8 ms per loop

In [16]: %timeit -n3 -r3 out=sliding_dot(A,B)
3 loops, best of 3: 2.18 s per loop

In [17]: (numpy.abs(out2-out)/numpy.abs(out)).max()
Out[17]: 4.2921323635558404e-15

when run on a 3GHz Phenom II with a GTX470 using 64 thread blocks on a 635x475 2D grid -- ie. about 50 times speed up including module loading, setup and memory transfers using pageable host memory allocations. The kernel itself is about 100 times faster than the Python without including memory transfers and setup overhead. Note that this is a double precision version - Python uses double precision floating point arithmetic by default.

Thanks for posting! Sorry I haven't had a chance to evaluate your solution yet. Just curious about why you didn't go with a texture-based implementation. — BrianTheLion, Oct 21 '11 at 22:53
Only because I doubt there will be much performance improvement in doing so. My block based version has fully coalesced reads of both the main matrix and the window matrix, which is faster than reading via textures randomly, and Fermi L1 cache is larger than the texture cache, so hits rates are probably just as high. My experience with other matrix operations showed binding to textures wasn't faster. — talonmies, Oct 22 '11 at 03:36

aland · Answer 2 · 2011-10-08T21:05:35.970

Well, here are some thoughs:

You perform ~640*480 iterations of numpy.dot, which itself processes 6*6*32 elements. Parallelizing dot-product barely worth it: 192 parallel threads is not enough for GPU, and reduction on CUDA is additional troubles. So, IMO, the best way to parallelize you task is to assign one element of output array to each thread.

Now about memory: output array will be in global memory, there is not much choice. For input data, A looks quite good for texture memory, since adjacent threads access adjacent elements. Alternatively, you can manually "cache" it in shared memory, but in this case it does not look much advantageous over simply using texture. For B, shared memory is not good, since it would cause bank conflicts, since when you calculate dot-product, all threads in half-warp access the same B's element (you can start summation from different elements in different threads, but that's (again) doesn't look promising). So the choice is either texture or constant. I vote for constant, since (a) constant memory is suited for data which is accessed by all threads on the device, (b) you won't pollute texture cache.

The above is just my guesses, and to actually achieve good performance you better try out different variants...

Update regarding your naive implementation

for (int Yi = 0; Yi < Ydims[0]; Yi++ )

Here, you do aceess to a global memory on each iteration. That's a huge performance killer. Since you have 3 dimensions, you better replace your int *Ydims with int3 Ydims (same for Xdims and outdims).

out[out_indx] += X[X_indx]*Y[Y_indx];

Again, a very bad idea. Create a register variable and do all operations with it. Write to a global array only once at the end of a kernel.

These optimizations are first thing you should do. Second thing is to make you X and Y 3D textures, so access to them will be cached. I guess, after this CUDA would outperform CPU.

For further optimisations, you'd better read CUDA C Best Practices Guide. It's must read, and you would get much better idea of how to write efficient GPU code (right now your implementation is toooo naive)

Thanks! Tried your suggestion and mapped each output pixel to a single thread. Haven't attempted to do any memory optimization. The results are mixed so far. — BrianTheLion, Oct 08 '11 at 19:48
Wow, awesome help! From what I can tell, kernel parameters are stored in local memory and local memory is off-chip. Is there any way I can get outdims, Xdims, and Ydims to on-chip memory? — BrianTheLion, Oct 08 '11 at 23:14
@BrianTheLion Nope, the kernel parameters are stored in on-chip shared memory, which is usually nearly as fast as registers. You might be confusing OpenCL'ish local memory, which is the same as CUDA'ish shared, and CUDA'ish local, which is actually just a part of off-chip global memory. — aland, Oct 09 '11 at 18:52
Cool. I'm now guessing that my v0.2 performance is due to the fact that I'm using 1D textures and hence am not getting the benefit of 2D-optimized caching. — BrianTheLion, Oct 09 '11 at 22:31

BrianTheLion · Answer 3 · 2011-10-10T03:09:11.417

v0.1 - Naive implementation

Here's my first, naive attempt at making this work:

__global__ void sliding_dot(float *out, int *outdims, float *X, int *Xdims, float *Y, int *Ydims )
{
    int i = threadIdx.x + blockDim.x * blockIdx.x;
    int j = threadIdx.y + blockDim.y * blockIdx.y;
    int Y_indx = 0;
    int X_indx = 0;
    if ( i < outdims[0] & j < outdims[1] )
    {
        int out_indx = j + i*outdims[1];
        for (int Yi = 0; Yi < Ydims[0]; Yi++ )
        {
            for (int Yj = 0; Yj < Ydims[1]; Yj++ )
            {
                for (int k = 0; k < Ydims[2]; k++ )
                {
                    Y_indx = k + Yj*    Ydims[2] + Yi*    Ydims[2]*Ydims[1];
                    X_indx = k + (j+Yj)*Xdims[2] + (i+Yi)*Xdims[2]*Xdims[1];
                    out[out_indx] += X[X_indx]*Y[Y_indx];
                }
            }
        }
    }
}

So far the results are less-than-desirable. With block size (32,32,1) and grid dimensions p,q chosen such that p*32 >= outdims[0] and q*32 >= outdims[1] :

method=[ sliding_dot ] gputime=[ 7013.280 ] cputime=[ 18.000 ] occupancy=[ 0.667 ] 
method=[ sliding_dot ] gputime=[ 6945.184 ] cputime=[ 7.000 ] occupancy=[ 0.667 ] 
method=[ sliding_dot ] gputime=[ 6990.816 ] cputime=[ 6.000 ] occupancy=[ 0.667 ] 
method=[ sliding_dot ] gputime=[ 6931.648 ] cputime=[ 6.000 ] occupancy=[ 0.667 ]

v0.2 - `texture<float,1>`

I hope everybody is learning as much from this as I am! I followed @aland's suggestions and got a considerable speed-up:

texture<float,1> X;
texture<float,1> Y;

__global__ void dotconv(float *out, int2 outdims, int3 Xdims, int3 Ydims )
{
    int i = threadIdx.x + blockDim.x * blockIdx.x;
    int j = threadIdx.y + blockDim.y * blockIdx.y;

    if ( i < outdims.x & j < outdims.y )
    {
        int out_indx = j + i*outdims.y;
        float total = 0.0f;
        int X_indx = 0;
        int Y_indx = 0;
        for (int Yi=0; Yi<Ydims.x; Yi++ )
        {
            for (int Yj=0; Yj<Ydims.y; Yj++ )
            {
                for (int k=0; k<Ydims.z; k++ )
                {
                    Y_indx = k + Yj*    Ydims.z + Yi*    Ydims.z*Ydims.y;
                    X_indx = k + (j+Yj)*Xdims.z + (i+Yi)*Xdims.z*Xdims.y;
                    total += tex1Dfetch(X,X_indx)*tex1Dfetch(Y,Y_indx);
                }
            }
        }
        out[out_indx] = total;
    }
}

But we're still not running as quickly as the CPU:

method=[ dotconv ] gputime=[ 2224.928 ] cputime=[ 24.000 ] occupancy=[ 0.667 ] 
method=[ dotconv ] gputime=[ 2222.592 ] cputime=[ 7.000 ] occupancy=[ 0.667 ] 
method=[ dotconv ] gputime=[ 2225.216 ] cputime=[ 10.000 ] occupancy=[ 0.667 ] 
method=[ dotconv ] gputime=[ 2222.752 ] cputime=[ 10.000 ] occupancy=[ 0.667 ]

v0.3 - `texture<float,3>`

texture<float,3,cudaReadModeElementType> X;
texture<float,3,cudaReadModeElementType> Y;

__global__ void dotconv(float *out, int2 outdims, int3 Xdims, int3 Ydims )
{
    int i = threadIdx.x + blockDim.x * blockIdx.x;
    int j = threadIdx.y + blockDim.y * blockIdx.y;
    if ( i < outdims.x & j < outdims.y )
    {
        int out_indx = j + i*outdims.y;
        float total = 0.0f;
        for (int Yi=0; Yi<Ydims.x; Yi++ )
        {
            for (int Yj=0; Yj<Ydims.y; Yj++ )
            {
                for (int k=0; k<Ydims.z; k++ )
                {
                    total += tex3D(X,k,j+Yj,i+Yi) * tex3D(Y,k,Yj,Yi);   
                }
            }
        }
        out[out_indx] = total;
    }
}

This is actually a little slower than the v0.2

method=[ dotconv ] gputime=[ 2403.360 ] cputime=[ 35.000 ] occupancy=[ 0.667 ] 
method=[ dotconv ] gputime=[ 2392.160 ] cputime=[ 15.000 ] occupancy=[ 0.667 ] 
method=[ dotconv ] gputime=[ 2396.448 ] cputime=[ 15.000 ] occupancy=[ 0.667 ] 
method=[ dotconv ] gputime=[ 2398.880 ] cputime=[ 16.000 ] occupancy=[ 0.667 ]

Thanks for your suggestions!

There is a lot of "low hanging fruit" in your fastest v0.2 version. You are currently performing *14* integer operations for every fmad in the dot product inner loop. That is an enormous overhead, and at least 12 of the 14 iops are redundant. — talonmies, Oct 12 '11 at 07:59

score 0 · Answer 4 · edited Oct 10 '11 at 19:19

You might want to try separating out your reads from your sums from your stores.

So each kernel should have 3 sections:

Read from texture memory, store to shared memory for the entire block

__shared blockX[ Ydims.z ][ Ydims.y ][ Ydims.x ];
__shared blockY[ Ydims.z ][ Ydims.y ][ Ydims.x ];
// NOTE: MAKE EACH THREAD LOAD k ELEMENTs * 2 rather than each thread loading Ydims.X*Y*Z elements
blockX[k][yj][yi] = ...
blockY[k][yj][yi] = ...
__syncthreads(); // <-- critical -- all threads in block must finish
// reading from shared memory before any may use the values.

#pragma Unroll your for loops.
This will significantly increase your ILP and have much less branching for your constant loop sizes
Ensure that your shared memory access is strided appropriately, otherwise bank conflicts will kill your performance.

Thanks! The shared memory optimization is what I've been working on this morning. We should know the outcome here shortly. — BrianTheLion, Oct 10 '11 at 18:37

Generalized sliding-window computation on the GPU

4 Answers4

v0.1 - Naive implementation

v0.2 - texture<float,1>

v0.3 - texture<float,3>

v0.2 - `texture<float,1>`

v0.3 - `texture<float,3>`