LAPACK/BLAS versus simple "for" loops

Question

I want to migrate a piece of code that involves a number of vector and matrix calculations to C or C++, the objective being to speed up the code as much as possible.

Are linear algebra calculations with for loops in C code as fast as calculations using LAPACK/BLAS, or there is some unique speedup from using those libraries?

In other words, could simple C code (using for loops and the like) perform linear algebra calculations as fast as code that utilizes LAPACK/BLAS?

If it were trivial to roll your own implementation that was just as fast, why would those libraries be widely-used? — Anon., Feb 21 '11 at 03:17
It should be fairly obvious that depends on the application at hand. — Ludi, May 28 '15 at 19:51
@Anon, "... why would those libraries be widely-used?" - to (1) unify the code style across users and let them don't care about (2) arrays size checks (if it's in C), (3) corner cases, you name it. — dizcza, Aug 17 '19 at 20:05

score 17 · Accepted Answer · answered Feb 21 '11 at 11:02

Vendor-provided LAPACK / BLAS libraries (Intel's IPP/MKL have been mentioned, but there's also AMD's ACML, and other CPU vendors like IBM/Power or Oracle/SPARC provide equivalents as well) are often highly optimized for specific CPU abilities that'll significantly boost performance on large datasets.

Often, though, you've got very specific small data to operate on (say, 4x4 matrices or 4D dot products, i.e. operations used in 3D geometry processing) and for those sort of things, BLAS/LAPACK are overkill, because of initial tests done by these subroutines which codepaths to choose, depending on properties of the data set. In those situations, simple C/C++ sourcecode, maybe using SSE2...4 intrinsics and/or compiler-generated vectorization, may beat BLAS/LAPACK.
That's why, for example, Intel has two libraries - MKL for large linear algebra datasets, and IPP for small (graphics vectors) data sets.

In that sense,

what exactly is your data set ?
What matrix/vector sizes ?
What linear algebra operations ?

Also, regarding "simple for loops": Give the compiler the chance to vectorize for you. I.e. something like:

for (i = 0; i < DIM_OF_MY_VECTOR; i += 4) {
    vecmul[i] = src1[i] * src2[i];
    vecmul[i+1] = src1[i+1] * src2[i+1];
    vecmul[i+2] = src1[i+2] * src2[i+2];
    vecmul[i+3] = src1[i+3] * src2[i+3];
}
for (i = 0; i < DIM_OF_MY_VECTOR; i += 4)
    dotprod += vecmul[i] + vecmul[i+1] + vecmul[i+2] + vecmul[i+3];

might be a better feed to a vectorizing compiler than the plain

for (i = 0; i < DIM_OF_MY_VECTOR; i++) dotprod += src1[i]*src2[i];

expression. In some ways, what you mean by calculations with for loops will have a significant impact.
If your vector dimensions are large enough though, the BLAS version,

dotprod = CBLAS.ddot(DIM_OF_MY_VECTOR, src1, 1, src2, 1);

will be cleaner code and likely faster.

On the reference side, these might be of interest:

Intel Math Kernel Libraries Documentation (LAPACK / BLAS and others optimized for Intel CPUs)
Intel Performance Primitives Documentation (optimized for small vectors / geometry processing)
AMD Core Math Libraries (LAPACK / BLAS and others for AMD CPUs)
Eigen Libraries (a "nicer" linear algebra interface)

score 7 · Answer 2 · answered Feb 21 '11 at 03:27

7

Probably not. People quite a bit of work into ensuring that lapack/BLAS routines are optimized and numerically stable. While the code is often somewhat on the complex side, it's usually that way for a reason.

Depending on your intended target(s), you might want to look at the Intel Math Kernel Library. At least if you're targeting Intel processors, it's probably the fastest you're going to find.

answered Feb 21 '11 at 03:27

Jerry Coffin

476,176
80
629
1,111

2

+1. ATLAS is a decent free and portable alternative, with no particular licensing problems. GoToBLAS is great, but their licensing sucks. – Alexandre C. Feb 21 '11 at 13:02

score 4 · Answer 3 · answered Jul 29 '12 at 16:08

Numerical analysis is hard. At the very least, you need to be intimately aware of the limitations of floating point arithmetic, and know how to sequence operations so that you balance speed with numerical stability. This is non-trivial.

You need to actually have some clue about the balance between speed and stability you actually need. In more general software development, premature optimization is the root of all evil. In numerical analysis, it is the name of the game. If you don't get the balance right the first time, you will have to re-write more-or-less all of it.

And it gets harder when you try to adapt linear algebra proofs into algorithms. You need to actually understand the algebra, so that you can refactor it into a stable (or stable enough) algorithm.

If I were you, I'd target the LAPACK/BLAS API and shop around for the library that works for your data set.

You have plenty of options: LAPACK/BLAS, GSL and other self-optimizing libraries, vender libraries.

score 1 · Answer 4 · answered Feb 21 '11 at 03:30

I dont meet this libraries very well. But you should consider that libraries usually make a couple of tests in parameters, they have a "sistem of comunication" to errors, and even the attribution to new variables when you call a function... If the calcs are trivial, maybe you can try do it by yourself, adaptating whith your necessities...

LAPACK/BLAS versus simple "for" loops

4 Answers4