GPU Array multiplications using Pycuda on Numpy arrays

Question

I have tried to implement Element-wise multiplication of two numpy arrays by making similar GPU arrays and performing the operations. However, the resulting execution time is much slower than the original numpy pointwise multiplication. I was hoping to get a good speedup using the GPU. zz0 is complex128 type, (64,256,16) shape numpy array and xx0 is float64 type,(16,151) shape numpy array. Can someone please help me figure out what I am doing wrong with respect to the implementation:

import sys
import numpy as np
import matplotlib.pyplot as plt
import pdb
import time

import pycuda.driver as drv
import pycuda.autoinit
from pycuda.compiler import SourceModule
from pycuda.elementwise import ElementwiseKernel
import pycuda.gpuarray as gpuarray
import pycuda.cumath
import skcuda.linalg as linalg

linalg.init()

# Function for doing a point-wise multiplication using GPU
def calc_Hyp(zz,xx):
    zz_stretch = np.tile(zz, (1,1,1,xx.shape[3]))
    xx_stretch = np.tile(xx, (zz.shape[0],zz.shape[1],1,1))
    zzg = gpuarray.to_gpu(zz_stretch)
    xxg = gpuarray.to_gpu(xx_stretch)
    zz_Hypg = linalg.multiply(zzg,xxg)
    zz_Hyp = zz_Hypg.get()
    return zz_Hyp


zz0 = np.random.uniform(10.0/5000, 20000.0/5000, (64,256,16)).astype('complex128')
xx0 = np.random.uniform(10.0/5000, 20000.0/5000, (16,151)).astype('float64')

xx0_exp = np.exp(-1j*xx0)

t1 = time.time()

#Using GPU for the calculation
zz0_Hyp = calc_Hyp(zz0[:,:,:,None],xx0_exp[None,None,:,:])
#np.save('zz0_Hyp',zz0_Hyp)

t2 = time.time()
print('Time taken with GPU:{}'.format(t2-t1))

#Original calculation
zz0_Hyp_actual = zz0[:,:,:,None]*xx0_exp[None,None,:,:]
#np.save('zz0_Hyp_actual',zz0_Hyp_actual)

t3 = time.time()
print('Time taken without GPU:{}'.format(t3-t2))

MCVE, why don't you actually show us zz0? We can't run this, also there is no x0. It should be very easy for you to provide a small example that runs with out us having to make stuff up in our heads. Ideally I should be able to take the code you have, and assuming I have pycuda and skcuda installed, it should just work (matplot lib is not needed here). If you don't come up with something we can replicate, your question will just get deleted. — Krupip, Jun 27 '19 at 17:44
Thanks for your response. I have updated the question with my working test code. — Killbill, Jun 28 '19 at 06:31
I get `interruped by signal 9: SIGKILL` after running your code. You are using *waaay* too much memory for this. Can you make a version that doesn't do that? I don't think it is necessary to allocate several gigabytes of RAM just to test what is essentially element wise mulitplication. I just cut down the dimensions and I was able to at least reproduce the results. — Krupip, Jun 28 '19 at 13:41

score 2 · Answer 1 · answered Jun 28 '19 at 15:18

The first issue is that your timing metrics are not accurate.

Linalg compiles cuda modules on the fly, and you may see code being compiles as you run it. I made some slight modifications to your code to reduce the size of the arrays being multiplied, but regardless, after two runs with no other improvements I saw massive gains in performance ex:

Time taken with GPU:2.5476348400115967
Time taken without GPU:0.16627931594848633

vs

Time taken with GPU:0.8741757869720459
Time taken without GPU:0.15836167335510254

However that is still much slower than the CPU version. The next thing I did was give a more accurate timing based upon where the actual computation is happening. You aren't tiling in your numpy version, so don't time it in your cuda version:

REAL Time taken with GPU:0.6461708545684814

You also copy to the GPU, and include that in the calculation, but that in itself takes a non trivial amount of time, so lets remove that:

t1 = time.time()
zz_Hypg = linalg.multiply(zzg,xxg)
t2 = time.time()
...
REAL Time taken with GPU:0.3689603805541992

Wow, that contributed a lot. But we still are slower than the numpy version? Why?

Remember when I said that numpy doesn't tile? It doesn't copy memory at all for broad casting. To get the real speed, you would have to:

not Tile
broadcast dimensions
implement this in a kernel.

Pycuda provides the utilities for kernel implementation, but its GPU array does not provide broadcasting. Essentially what you would have to do is this (DISCLAIMER: I haven't tested this, there are probably bugs, this is just to demonstrate approximately what the kernel should look like):

#include <pycuda-complex.hpp>
//KERNEL CODE
constexpr unsigned work_tile_dim = 32
//instruction level parallelism factor, how much extra work to do per thread, may be changed but effects the launch dimensions. thread group size should be (tile_factor, tile_factor/ilp_factor)
constexpr unsigned ilp_factor = 4
//assuming c order: 
//    x axis contiguous out, 
//    y axis contiguous in zz, 
//    x axis contiguous in xx
// using restrict because we know that all pointers will refer to different parts of memory. 
__global__ 
void element_wise_multiplication(
    pycuda::complex<double>* __restrict__ array_zz, 
    pycuda::complex<double>* __restrict__ array_xx,
    pycuda::complex<double>* __restrict__ out_array,
    unsigned array_zz_w, /*size of w,z,y, dimensions used in zz*/
    unsigned array_zz_z,
    unsigned array_zz_xx_y,/*size of y,x, dimensions used in xx, but both have same y*/
    unsigned array_xx_x){


    // z dimensions in blocks often have restrictions on size that can be fairly small, and sometimes can cause performance issues on older cards, we are going to derive x,y,z,w index from just the x and y indicies instead. 
    unsigned x_idx = blockIdx.x * (work_tile_dim) + threadIdx.x
    unsigned y_idx = blockIdx.y * (work_tile_dim) + threadIdx.y
    //blockIdx.z stores both z and w and should not over shoot, and aren't used
    //shown for the sake of how to get these dimensions. 
    unsigned z_idx = blockIdx.z % array_zz_z;
    unsigned w_idx = blockIdx.z / array_zz_z;
    //we already know this part of the indexing calculation. 
    unsigned out_idx_zw = blockIdx.z * (array_zz_xx_y * array_xx_z);
    // since our input array is actually 3D, this is a different calcualation
    unsigned array_zz_zw = blockIdx.z * (array_zz_xx_y)
    //ensures if our launch dimensions don't exactly match our input size, we don't 
    //accidently access out of bound memory, while branching can be bad, this isn't 
    // because 99.999% of the time no branch will occur and our instruction pointer 
    //will be the same per warp, meaning virtually zero cost. 
    if(x_idx < array_xx_x){
        //moving over y axis to coalesce memory accesses in the x dimension per warp. 
        for(int i = 0; i < ilp_factor; ++i){
            //need to also check y, these checks are virtually cost-less 
            // because memory access dominates time in such simple calculations, 
            // and arithmetic will be hidden by overlapping execution 
            if((y_idx+i) < array_zz_xx_y){
                //splitting up calculation for simplicity sake
                out_array_idx = out_idx_zw+(y_idx+i)*array_xx_x + x_idx;
                array_zz_idx = array_zz_zw + (y_idx+i);
                array_xx_idx = ((y_idx+i) * array_xx_x) + x_idx;
                //actual final output. 
                out_array[out_array_idx] = array_zz[array_zz_idx] * array_xx[array_xx_idx];
            }
        }
    }
}

You will have to make the launch dimensions something like:

thread_dim = (work_tile_dim, work_tile_dim/ilp_factor) # (32,8)
y_dim = xx0.shape[0]
x_dim = xx0.shape[1]
wz_dim = zz0.shape[0] * zz0.shape[1]
block_dim = (x_dim/work_tile_dim, y_dim/work_tile_dim, wz_dim)

And there are several further optimizations you may be able to take advantage of:

store global memory accesses in work tile in shared memory inside of kernel, this ensures that accesses to zz0s "y", but really x dimension are coallesced when put into shared memory, increasing performance, then accessed from shared memory (where coalescing doesn't matter, but bank conflicts do). See here on how to deal with that kind of bank conflict.
instead of calculating eulers formula and expanding a double into a complex double, expand it inside of the kernel itself, use sincos(-x, &out_sin, &out_cos) to achieve the same result, but utilizing way less memory bandwidth (see here).

But note, even doing this will likely not give you the performance you want (though will still likely be faster) unless you are on a higher end GPU with full double precision units, which aren't on most GPUs (most of the time it is emulated). Double precision floating point units take up a lot of space, and since gpus are used for graphics, they don't have much use for double precision. If you want higher precision than floating point, but want to take advantage of floating point hardware with out a 1/8 to 1/32 throughput hit of double, you can use the techniques used in this answer to achieve this on the gpu, getting you closer to 1/2 to 1/3 throughput.

GPU Array multiplications using Pycuda on Numpy arrays

1 Answers1