Why is solving system of linear equations using cula(dgesv) slower than mkl (dgesv) for small data sets

Question

I have written a CUDA C and C program to solve a matrix equation Ax=b using CULA routine dgesv and MKL routine dgesv. It seems like for a small data set, the CPU program is faster than the GPU program. But the GPU overcomes the CPU as the data set increases past 500. I am using my dell laptop which has i3 CPU and Geforce 525M GPU. What is the best explanation for the initial slow performance of the GPU?

I wrote another program which takes two vectors, multiplies them and add the result. This is just like the dot product just that the result is a vector sum not a scalar. In this program, the GPU is faster than the CPU even for small data set. I am using the same notebook. Why is the GPU faster in this program even for small data set as compared to the one explained above? Is it because there is not much computation involved in the summation?

Communication overhead! For speedup, data transfer (at PCI-E bandwidth) overhead in GPGPU system should not be bottleneck of the application. In small data set, it does not worth to move small data on GPU because the computation is negligible. At this case, kernel launch and data move time govern the computation-speedup benefits leading to performance downgrade. — lashgar, Nov 26 '12 at 08:38
The GPU will provide a benefit when the reduction in computation time on the GPU (over the CPU) exceeds the cost of the data transfer. I believe that solving a system of linear equations is [somewhere between O(n^2) and O(n^3) complexity](http://en.wikipedia.org/wiki/LU_decomposition). For very small n, this computational complexity may not be large enough to offset the cost of data transfer. But clearly as n becomes larger it should. On the other hand your vector operation may only be O(n) complexity. So the benefit scenario will look different. — Robert Crovella, Nov 26 '12 at 23:11
@RobertCrovella: The model FLOP count of dgesv with one RHS is usually taken as [0.67n^3](http://www.netlib.org/lapack/lug/node71.html#standardflopcount). — talonmies, Nov 27 '12 at 01:00
@RobertCrovella: Perhaps you could add this as an answer so as to get the question off the unanswered list. — talonmies, Jan 08 '13 at 09:30

score 2 · Answer 1 · answered Jan 08 '13 at 23:12

It's not uncommon for GPUs to be less interesting on small data sets as compared to large data sets. The reasons for this will vary depending on the specific algorithm. GPUs generally have a higher main memory bandwidth than CPUs and also can usually outperform them for heavy-duty number crunching. But GPUs usually only work well when there is parallelism inherent in the problem, which can be exposed. Taking advantage of this parallelism allows an algorithm to tap into the greater memory bandwidth as well as the higher compute capability.

However, before the GPU can do anything, it's necessary to get the data to the GPU. And this creates a "cost" to the GPU version of the code that will not normally be present in the CPU version.

To be more precise, the GPU will provide a benefit when the reduction in computation time on the GPU (over the CPU) exceeds the cost of the data transfer. I believe that solving a system of linear equations is somewhere between O(n^2) and O(n^3) complexity. For very small n, this computational complexity may not be large enough to offset the cost of data transfer. But clearly as n becomes larger it should. On the other hand your vector operation may only be O(n) complexity. So the benefit scenario will look different.

For the O(n^2) or O(n^3) case, as we move to larger data sets, the "cost" to transfer the data increases as O(n), but the compute requirements for solution increase as O(n^2) (or O(n^3)). Therefore larger data sets should have exponentially larger compute workloads, reducing the effect of the "cost" of the data transfer. An O(n) problem on the other hand, probably won't have this scaling dynamic. The workload increases at the same rate as the "cost" of data transfer.

Also note that if the "cost" of transferring data to the GPU can be hidden by overlapping it with computation work, then the "cost" for the overlapped portion becomes "free", i.e. it does not contribute to the overall solution time.

Why is solving system of linear equations using cula(dgesv) slower than mkl (dgesv) for small data sets

1 Answers1

Linked