Arrayfire Vectorization

Question

I'm trying to speed up the following calculations but have not been able to reach the desired speed. Im sure the issue is with my code and not physical limitations of the GPU.

I have a matrix V that is 10,000 x 6 x 6. And another matrix P that is 6 x 1,000

Both complex

I need to do V * P (which should results in 10,000 x 6 x 1000) Take the magnitude (or mag sq) of it and then sum in the 6 dimension. resulting in a 10,000 x 1000 of real values.

I have tried the following:

af::array V{ 10000, 6, 6, c32 };
af::array P{ 6, 1000, c32 };
af::array VP = af::matmul(V, P); (results in 10,000x1000x6 - ok, as long as i still sum in the 6 dim)
af::array res = af::sum(af::abs(VP),2);

This was not nealy fast enough. Then I tried converting V into an array, so I had:

af::array V[6] = { af::array{ 10000, 6, c32 },
            af::array{ 10000, 6, c32 }, af::array{ 10000, 6, c32 }, af::array{
                    10000, 6, c32 }, af::array{ 10000, 6, c32 }, af::array{
                    10000, 6, c32 } };
af::array VP[6];
af::array res;
for (int i = 0; i < 6; i++)
{
    VP[i] = af::matmul(V[i], P);
}
res= af::abs(mCalledData[0]);

for (int i = 1; i < 6; i++)
{
    res+= af::abs(VP[i]);
}

This had about a 2x speedup. I came up with another solution but af::matmult that takes in 3 arrays doesn't support options (like hermitian) and doesn't support gfor, so I couldn't try that route.

Currently, the matrix multiply (in both approaches) takes about 2.2ms and it looks like arrayfire can combine the abs and sum into one JIT kernel that takes about 2 ms.

My knowledge of arrayfire is limited so i'm guessing there is something I'm not thinking of. Does anyone have an idea of how I can increase the speed of this algorithm?

Thank you!

Hi, I am Pradeep, dev from ArrayFire core team. I have some queries. 1) You have a matrix 10k x 6 and another which is 6 x 1. 2) You want to matrix multiplication of these two to get 10k x 1 matrix. However there are 6 x 10k such operations. Is that right ? — pradeep, Jan 07 '20 at 10:56
10k x 6 x6 and 6 x 1k. which results in 10k x 1k x 6 (with arrayfire) or i can do 6 matrix multiplies, each that are [10k x 6] * [6 x 1k]. — AAG, Jan 07 '20 at 12:53
I think I understsand what you are trying to do now. Let me get back to you after running the code to check runtimes. What is the GPU you have ? — pradeep, Jan 08 '20 at 14:36

pradeep · Answer 1 · 2020-01-10T18:41:20.290

I can confirm your findings that looped version is about twice as fast as the batched matmul. Matmul on its own is not essentially the one taking long runtime in your code snippet, it is the other operation of summing up along third dimension after abs which is costly. It is due to the following reasons.

1) sum(abs(result)) - abs is again not issue here. Sum is reduction algorithm, which are usually quite fast along the fast moving dimension. However, reduction along higher dimension the element stride is size of the matrix for successive elements. This expensive compared to reduction along continuous locations.

2) looped abs additions - This version is however is accessing elements that continuous in memory because, we are basically adding respective elements of 6 matrices. On top of this, the entire loop (along with abs OP) will be converted into a single JIT kernel that does the following which is very efficient.

res = res + ptr0[i] + ptr1[i] + ptr2[i] + ptr0[i] + ptr1[i]

Above line is just for illustration, that is not the exact JIT kernel.

Hence, the batched version is faster than looped version in this specific case because of the reduction operation that is being done on the result of matmul.

My test GPU: GTX 1060

The matmul itself for a single [10k x 6] * [6 x 1k] is about half a millisecond on GTX 1060. Six such matmuls can't be done under millisecond on my GTX 1060 at least I would think. What is your target runtime ?

EDITED (Jan 10, 2020): - Actually, This won't work because of abs operation on result of each matmul.

You can try looking into our latest entry into gemm category in master branch of ArrayFire. However, you would have to build arrayfire from source until our next feature release 3.7. You can look at the documentation at the following page.

https://github.com/arrayfire/arrayfire/blob/master/include/af/blas.h#L230

It follows the principle of Carray from cuBLAS gemm API.

My target runtime is AS_FAST_AS_POSSIBLE ms :-) due to the fact that simple changes like looping vs batching has such a large effect makes me think there is still a lot of room for speedup. I'm currently looking into writing my own modified matrix multiplication kernel that will loop on the 6 matrices, do the mag-sqr and sum them up, all in one kernel so i'm not loading from global memory more than I need to. My current problem is the blocksize i'm using for shared memory is 6x6, which doesn't help that much (due to the dimension of the matrices). — AAG, Jan 09 '20 at 16:29
Yes, your dimensions are part of the performance issue too I think. One dimension that is 1000 times longer than the other one can cause issues. ArrayFire's matrix multiplication uses CUDA toolkit's cuBLAS library which is fine tuned for different sets of sizes and batch modes. Therefore, I personally wouldn't write a custom kernel for matrix multiplication - the chances of getting it faster than that are very slim to none. Is this size of your matrix mul op fixed to `[10k x 6] * [6 x 1k]` ? — pradeep, Jan 09 '20 at 20:28

Arrayfire Vectorization

1 Answers1