What are the fastest available implementations of BLAS/LAPACK or other linear algebra routings on GPU systems?

Question

nVidia, for example, has CUBLAS, which promises 7-14x speedup. Naively, this is nowhere near the theoretical throughput of any of nVidia's GPU cards. What are the challenges in speeding up linear algebra on GPUs, and are there faster linear algebra routings already available?

I don't understand why there have been votes to close. I am seeking feedback from users of GPU-accelerated libraries, possibly links to benchmarking studies, or other such information. — Jiahao Chen, Sep 04 '12 at 20:48

score 0 · Answer 1 · answered Sep 04 '12 at 21:50

As far as I know, CUBLAS is the fastest linear algebra implementation available for Nvidia GPUs. If you require LAPACK functionality, there's CULAPACK.

Note that CUBLAS only covers dense linear algebra; for sparse matrices, there's CUSPARSE (also provided as part of the CUDA toolkit).

The speedup greatly depends on the type of data you're operating on, as well as the specific operation you're performing. Some linear algebra operations parallelize very well, and others don't because they're inherently sequential. Optimization of numerical algorithms for parallel architectures is (and has been, for decades) an ongoing area of research -- so the performance of the algorithms is continually improving.

What are the fastest available implementations of BLAS/LAPACK or other linear algebra routings on GPU systems?

1 Answers1