PyCUDA GPUArray slice-based operations

Question

The PyCUDA documentation is a bit light on examples for those of us in the 'Non-Guru' class, but I'm wondering about the operations available for array operations on gpuarrays, ie. if I wanted to gpuarray this loop;

m=np.random.random((K,N,N))
a=np.zeros_like(m)
b=np.random.random(N) #example
for k in range(K):
    for x in range(N):
        for y in range(N):
            a[k,x,y]=m[k,x,y]*b[y]

The regular first-stop python reduction for this would be something like

for k in range(K):
    for x in range(N):
        a[k,x,:]=m[k,x,:]*b

But I can't see any simple way to do this with GPUArray, other than writing a custom elementwise kernel, and even then with this problem there would have to be looping constructs in the kernel and at that point of complexity I'm probably better off just writing my own full blown SourceModule kernel.

Can anyone clue me in ?

score 2 · Answer 1 · answered Aug 01 '13 at 13:50

You can also use the memcpy_dtod() method and the slicing functionality of gpuarrays. Its strange that normal assignment does not work. set() does not work because it assumes host to device transfer (using memcpy_htod()).

    for k in range(K):
        for x in range(N):
            pycuda.driver.memcpy_dtod(a[k,x,:].gpudata, (m[k,x,:]*b).gpudata, a[k,x,:].nbytes)

score 2 · Accepted Answer · answered Apr 18 '11 at 20:23

2

That is probably best done with your own kernel. While PyCUDA's gpuarray class is a really convenient abstraction of GPU memory into something which can be used interchangeably with numpy arrays, there is no getting around the need to code for the GPU for anything outside of the canned linear algebra and parallel reduction operations.

That said, it is a pretty trivial little kernel to write. So trivial that it would be memory bandwidth bound - you might want to see if you can "fuse" a few like operations together to improve the ratio of FLOPS to memory transactions a bit.

If you need some help with the kernel, drop in a comment, and I can expand the answer to include a rough prototype.

answered Apr 18 '11 at 20:23

talonmies

70,661
34
192
269

I always appreciate guidance so I'd be really interested to see someone elses take on it, but my little experimental implementation is a bit lazy and breaks out the k's as the blockIdx.x, and x and y as their respective threadid's (N is never going to be anywhere near 512 so I'm ok with that), with the relevant block and grid dimensions. If I'm on the right track then ignore this, but any additional insight would be great. – Bolster Apr 18 '11 at 20:54
1

With row major ordered arrays, the y-dimension should be the dimension where reads need to be coalesced for maximum throughput. So I do would do the inner loop within a block, and the outer dimensions unrolled over the grid. – talonmies Apr 18 '11 at 21:12
So (just to be sure), the k and x values are the threadid's and the y's are the block id's? – Bolster Apr 18 '11 at 21:22
1

No, the other way. By having threads within the same block (and warps within the same block) read along the y (third) dimension, memory reads should be coalesced for row major ordered storage. This so of operation is completely bandwidth bound, so optimizing memory access is key to getting reasonable performance. – talonmies Apr 18 '11 at 21:30

PyCUDA GPUArray slice-based operations

2 Answers2