The first issue is that your timing metrics are not accurate.
Linalg compiles cuda modules on the fly, and you may see code being compiles as you run it. I made some slight modifications to your code to reduce the size of the arrays being multiplied, but regardless, after two runs with no other improvements I saw massive gains in performance ex:
Time taken with GPU:2.5476348400115967
Time taken without GPU:0.16627931594848633
vs
Time taken with GPU:0.8741757869720459
Time taken without GPU:0.15836167335510254
However that is still much slower than the CPU version. The next thing I did was give a more accurate timing based upon where the actual computation is happening. You aren't tiling in your numpy version, so don't time it in your cuda version:
REAL Time taken with GPU:0.6461708545684814
You also copy to the GPU, and include that in the calculation, but that in itself takes a non trivial amount of time, so lets remove that:
t1 = time.time()
zz_Hypg = linalg.multiply(zzg,xxg)
t2 = time.time()
...
REAL Time taken with GPU:0.3689603805541992
Wow, that contributed a lot. But we still are slower than the numpy version? Why?
Remember when I said that numpy doesn't tile? It doesn't copy memory at all for broad casting. To get the real speed, you would have to:
- not Tile
- broadcast dimensions
- implement this in a kernel.
Pycuda provides the utilities for kernel implementation, but its GPU array does not provide broadcasting. Essentially what you would have to do is this (DISCLAIMER: I haven't tested this, there are probably bugs, this is just to demonstrate approximately what the kernel should look like):
#include <pycuda-complex.hpp>
//KERNEL CODE
constexpr unsigned work_tile_dim = 32
//instruction level parallelism factor, how much extra work to do per thread, may be changed but effects the launch dimensions. thread group size should be (tile_factor, tile_factor/ilp_factor)
constexpr unsigned ilp_factor = 4
//assuming c order:
// x axis contiguous out,
// y axis contiguous in zz,
// x axis contiguous in xx
// using restrict because we know that all pointers will refer to different parts of memory.
__global__
void element_wise_multiplication(
pycuda::complex<double>* __restrict__ array_zz,
pycuda::complex<double>* __restrict__ array_xx,
pycuda::complex<double>* __restrict__ out_array,
unsigned array_zz_w, /*size of w,z,y, dimensions used in zz*/
unsigned array_zz_z,
unsigned array_zz_xx_y,/*size of y,x, dimensions used in xx, but both have same y*/
unsigned array_xx_x){
// z dimensions in blocks often have restrictions on size that can be fairly small, and sometimes can cause performance issues on older cards, we are going to derive x,y,z,w index from just the x and y indicies instead.
unsigned x_idx = blockIdx.x * (work_tile_dim) + threadIdx.x
unsigned y_idx = blockIdx.y * (work_tile_dim) + threadIdx.y
//blockIdx.z stores both z and w and should not over shoot, and aren't used
//shown for the sake of how to get these dimensions.
unsigned z_idx = blockIdx.z % array_zz_z;
unsigned w_idx = blockIdx.z / array_zz_z;
//we already know this part of the indexing calculation.
unsigned out_idx_zw = blockIdx.z * (array_zz_xx_y * array_xx_z);
// since our input array is actually 3D, this is a different calcualation
unsigned array_zz_zw = blockIdx.z * (array_zz_xx_y)
//ensures if our launch dimensions don't exactly match our input size, we don't
//accidently access out of bound memory, while branching can be bad, this isn't
// because 99.999% of the time no branch will occur and our instruction pointer
//will be the same per warp, meaning virtually zero cost.
if(x_idx < array_xx_x){
//moving over y axis to coalesce memory accesses in the x dimension per warp.
for(int i = 0; i < ilp_factor; ++i){
//need to also check y, these checks are virtually cost-less
// because memory access dominates time in such simple calculations,
// and arithmetic will be hidden by overlapping execution
if((y_idx+i) < array_zz_xx_y){
//splitting up calculation for simplicity sake
out_array_idx = out_idx_zw+(y_idx+i)*array_xx_x + x_idx;
array_zz_idx = array_zz_zw + (y_idx+i);
array_xx_idx = ((y_idx+i) * array_xx_x) + x_idx;
//actual final output.
out_array[out_array_idx] = array_zz[array_zz_idx] * array_xx[array_xx_idx];
}
}
}
}
You will have to make the launch dimensions something like:
thread_dim = (work_tile_dim, work_tile_dim/ilp_factor) # (32,8)
y_dim = xx0.shape[0]
x_dim = xx0.shape[1]
wz_dim = zz0.shape[0] * zz0.shape[1]
block_dim = (x_dim/work_tile_dim, y_dim/work_tile_dim, wz_dim)
And there are several further optimizations you may be able to take advantage of:
store global memory accesses in work tile in shared memory inside of kernel, this ensures that accesses to zz0s "y", but really x dimension are coallesced when put into shared memory, increasing performance, then accessed from shared memory (where coalescing doesn't matter, but bank conflicts do). See here on how to deal with that kind of bank conflict.
instead of calculating eulers formula and expanding a double into a complex double, expand it inside of the kernel itself, use sincos(-x, &out_sin, &out_cos)
to achieve the same result, but utilizing way less memory bandwidth (see here).
But note, even doing this will likely not give you the performance you want (though will still likely be faster) unless you are on a higher end GPU with full double precision units, which aren't on most GPUs (most of the time it is emulated). Double precision floating point units take up a lot of space, and since gpus are used for graphics, they don't have much use for double precision. If you want higher precision than floating point, but want to take advantage of floating point hardware with out a 1/8 to 1/32 throughput hit of double, you can use the techniques used in this answer to achieve this on the gpu, getting you closer to 1/2 to 1/3 throughput.