I've done the following test on Matlab:
n = 10000;
A = rand(n,n);
b = rand(n, 1);
tic
y = A\b;
toc
On my Intel i7 gen 5 machine (12 cores) the result is ~ 5 seconds.
Then, I've trying to do the same using CUDA 9.2 sample SDK code (see cuSolverDn_LinearSolver.cpp). Surprisingly, on my Nvidia 970GTX I get ~ 6.5 seconds to get the solution for the same problem size as above!
What is it wrong ? I mention that my matrix is symmetric, square and b has only 1 column. Is there a better way to solve this problem using CUDA? Should I expect greater performance if I'm going to use a newer GPU?