-3

I've done the following test on Matlab:

n = 10000;
A = rand(n,n);
b = rand(n, 1);

tic
y = A\b;
toc

On my Intel i7 gen 5 machine (12 cores) the result is ~ 5 seconds.

Then, I've trying to do the same using CUDA 9.2 sample SDK code (see cuSolverDn_LinearSolver.cpp). Surprisingly, on my Nvidia 970GTX I get ~ 6.5 seconds to get the solution for the same problem size as above!

What is it wrong ? I mention that my matrix is symmetric, square and b has only 1 column. Is there a better way to solve this problem using CUDA? Should I expect greater performance if I'm going to use a newer GPU?

Bhargav Rao
  • 50,140
  • 28
  • 121
  • 140
paduraru2009
  • 595
  • 2
  • 5
  • 11
  • 4
    How can `A = rand(n,n);` yield a symmetric matrix, as you claim in your question? – talonmies Dec 21 '18 at 17:50
  • Have you performed the experiment in single or double precision? A 12 core i7 is quite a beefy machine by the way. – tera Dec 21 '18 at 18:02
  • Try using Matlabs gpu funcionality and see how fast that is. If it is faster than the cpu code then you can certainly do better, if not, the benefit is perhaps countered by the overhead – Nicky Mattsson Dec 21 '18 at 18:48
  • In cpp i made it symmetric, not in matlab. I’m using doubles. – paduraru2009 Dec 21 '18 at 19:17

1 Answers1

2

Here is the code I used to test this

n = 10000;
A = rand(n,n,'single');
b = rand(n, 1,'single');

tic
y = A\b;
toc

A = gpuArray(A);
b = gpuArray(b);

tic
y = A\b;
toc

Here are the results

Elapsed time is 2.673490 seconds.
Elapsed time is 0.553348 seconds.

I am running on a 7700 4 core laptop with a GTX 1060 GPU so approximately the same computing power I think. As you can see in this case the GPU runs faster. The most likely factor is the precision. GPUs only have single precision multipliers while CPUs have double precision multipliers. If you have to do double precision multiplication on a GPU you have to take quite a few multipliers to do the same operation thus drastically slowing down your speed. If I change it so the variables are double precision we now get:

Elapsed time is 5.784525 seconds.
Elapsed time is 5.398702 seconds.

While the GPU is still faster on my computer the point still stands in that the CPU and GPU are much closer together now.

Durkee
  • 778
  • 6
  • 14
  • Interesting to see `double` being half as fast as `single`on CPU. That points at the bottleneck being data transfer from memory to CPU, because the CPU does single float computations using the same representation as double floats (usually 10 bytes, long double, for modern CPUs). – Cris Luengo Dec 21 '18 at 18:53
  • I know, I was surprised by that too. Thing is the data transfer should be a much smaller fraction since it's only 10000 values. Are we sure modern CPUs don't have single precision cores are they all double precision? – Durkee Dec 21 '18 at 18:55
  • 4
    SSE/AVX vector units will compute twice as many floats per cycle as doubles. This is probably what is used at the optimized BLAS level behind these operations – Peter Dec 21 '18 at 19:04
  • It is `10000 * 10000` values, that is more than fits in your cache. I am not 100% sure about the CPU not having a single-precision core. As @Peter mentions, there are SIMD instructions that logically can process twice as many single-precision floats as double-precision floats. But I'm not sure how much those are used here. – Cris Luengo Dec 21 '18 at 19:07
  • 3
    "GPUs only have single precision multipliers": Since the Fermi architecture (ca. 10 years ago) all NVIDIA GPUs have included double-precision arithmetic units. However, the throughput of double-precision operations is only 1/2 to 1/64 of the single-precision operations, with consumer GPUs (such as the asker's GPU) at the low-end of the spectrum. – njuffa Dec 21 '18 at 19:09
  • 1
    MATLAB uses the Intel MKL for its BLAS (and LAPACK) implementation, which most certainly uses the highest-supported vector instruction set. – Peter Dec 21 '18 at 20:26