Vendor-provided LAPACK / BLAS libraries (Intel's IPP/MKL have been mentioned, but there's also AMD's ACML, and other CPU vendors like IBM/Power or Oracle/SPARC provide equivalents as well) are often highly optimized for specific CPU abilities that'll significantly boost performance on large datasets.
Often, though, you've got very specific small data to operate on (say, 4x4 matrices or 4D dot products, i.e. operations used in 3D geometry processing) and for those sort of things, BLAS/LAPACK are overkill, because of initial tests done by these subroutines which codepaths to choose, depending on properties of the data set. In those situations, simple C/C++ sourcecode, maybe using SSE2...4 intrinsics and/or compiler-generated vectorization, may beat BLAS/LAPACK.
That's why, for example, Intel has two libraries - MKL for large linear algebra datasets, and IPP for small (graphics vectors) data sets.
In that sense,
- what exactly is your data set ?
- What matrix/vector sizes ?
- What linear algebra operations ?
Also, regarding "simple for loops": Give the compiler the chance to vectorize for you. I.e. something like:
for (i = 0; i < DIM_OF_MY_VECTOR; i += 4) {
vecmul[i] = src1[i] * src2[i];
vecmul[i+1] = src1[i+1] * src2[i+1];
vecmul[i+2] = src1[i+2] * src2[i+2];
vecmul[i+3] = src1[i+3] * src2[i+3];
}
for (i = 0; i < DIM_OF_MY_VECTOR; i += 4)
dotprod += vecmul[i] + vecmul[i+1] + vecmul[i+2] + vecmul[i+3];
might be a better feed to a vectorizing compiler than the plain
for (i = 0; i < DIM_OF_MY_VECTOR; i++) dotprod += src1[i]*src2[i];
expression. In some ways, what you mean by calculations with for loops will have a significant impact.
If your vector dimensions are large enough though, the BLAS version,
dotprod = CBLAS.ddot(DIM_OF_MY_VECTOR, src1, 1, src2, 1);
will be cleaner code and likely faster.
On the reference side, these might be of interest: