I'm trying to speed up the following calculations but have not been able to reach the desired speed. Im sure the issue is with my code and not physical limitations of the GPU.
I have a matrix V that is 10,000 x 6 x 6. And another matrix P that is 6 x 1,000
Both complex
I need to do V * P (which should results in 10,000 x 6 x 1000) Take the magnitude (or mag sq) of it and then sum in the 6 dimension. resulting in a 10,000 x 1000 of real values.
I have tried the following:
af::array V{ 10000, 6, 6, c32 };
af::array P{ 6, 1000, c32 };
af::array VP = af::matmul(V, P); (results in 10,000x1000x6 - ok, as long as i still sum in the 6 dim)
af::array res = af::sum(af::abs(VP),2);
This was not nealy fast enough. Then I tried converting V into an array, so I had:
af::array V[6] = { af::array{ 10000, 6, c32 },
af::array{ 10000, 6, c32 }, af::array{ 10000, 6, c32 }, af::array{
10000, 6, c32 }, af::array{ 10000, 6, c32 }, af::array{
10000, 6, c32 } };
af::array VP[6];
af::array res;
for (int i = 0; i < 6; i++)
{
VP[i] = af::matmul(V[i], P);
}
res= af::abs(mCalledData[0]);
for (int i = 1; i < 6; i++)
{
res+= af::abs(VP[i]);
}
This had about a 2x speedup. I came up with another solution but af::matmult that takes in 3 arrays doesn't support options (like hermitian) and doesn't support gfor, so I couldn't try that route.
Currently, the matrix multiply (in both approaches) takes about 2.2ms and it looks like arrayfire can combine the abs and sum into one JIT kernel that takes about 2 ms.
My knowledge of arrayfire is limited so i'm guessing there is something I'm not thinking of. Does anyone have an idea of how I can increase the speed of this algorithm?
Thank you!