Fast solution of dense linear system of fixed dimension (N=9), symmetric, positive-semidefinite

Question

Which algorithm you would recommend for fast solution of dense linear system of fixed dimension (N=9) (matrix is symmetric, positive-semidefinite)?

Gaussian elimination
LU decomposition
Cholesky decomposition
etc?

Types are 32 and 64 bits floating points.

Such systems will be solved millions of times, so algorithm should be rather fast with respect to dimension (n=9).

P.S. examples of robust C++ implementations for proposed algorithm are appreciated.

1) What do you mean by "solved million of times"? Same coefficient matrix with a million of different right hand terms, or a million of distinct matrices?

Million of distinct matrices.

2) Positive _semi_definite means that matrix can be singular (to machine precision). How would you like to deal with this case? Just raise an error, or try to return some sensible answer?

Raising error is OK.

Your question might be a better fit on http://scicomp.stackexchange.com/ — biziclop, Nov 13 '12 at 22:00
I would definitely suggest to repost the question on scicomp: you should however specify a few additional info. 1) What do you mean by "solved million of times"? Same coefficient matrix with a million of different right hand terms, or a million of distinct matrices? 2) Positive _semi_definite means that matrix can be singular (to machine precision). How would you like to deal with this case? Just raise an error, or try to return some sensible answer? — Stefano M, Nov 13 '12 at 23:10

B. Decoster · Accepted Answer · 2017-06-27T15:29:18.793

7

The matrix being symmetric, positive-semidefinite, the Cholesky decomposition is strictly superior to the LU decomposition. (roughly twice faster than LU, whatever the size of the matrix. Source : "Numerical Linear Algebra" by Trefethen and Bau)

It is also de facto the standard for small dense matrices (source : I do a PhD in computational mathematics) Iterative methods are less efficient than direct methods unless the system becomes large enough (quick rule of thumb that means nothing, but that is always nice to have : on any modern computer, any matrix smaller than 100*100 is definitely a small matrix that needs direct methods, rather than iterative ones)

Now, I do not recommend to do it yourself. There are tons of good libraries out there that have been thoroughly tested. But if I have to recommend you one, it would be Eigen :

No installation required (header only library, so no library to link, only #include<>)
Robust and efficient (they have a lot of benchmarks on the main page, and the results are nice)
Easy to use and well documented

By the way, here in the documentation, you have the various pros and cons of their 7 direct linear solvers in a nice, concise table. It seems that in your case, LDLT (a variation of Cholesky) wins

edited Jun 27 '17 at 15:29

answered Nov 14 '12 at 07:14

B. Decoster

7,723
1
32
52

"But if I have to recommend you one, it would be Eigen" - Yes, I considered it - it is pretty cool, and has built-in fixed size vectors and matrices. But unfortunatly documentation says that it doesn't work on older compilers (like GCC 3.x, VS2005, etc). Though such limitation it is pretty reasonable for very high-performance libraries, but project I work on should support such older compilers. "LDLT (a variation of Cholesky) wins" - yes, I checked it, it doesn't require sqrt in compare to LLT. My main concern was - maybe LU is better for small matrices like 9x9? – qble Nov 14 '12 at 07:28
Ohhhhh damn, I never considered supporting old libraries... Well, I don't know a library that would suit your needs, but what I know is that when the matrix is s.p.d, then Cholesky is roughly twice faster than LU (source : "Numerical Linear Algebra" by Trefethen and Bau) It's independent of the size. Perhaps you'll have to implement it yourself... – B. Decoster Nov 14 '12 at 07:35
1

"Perhaps you'll have to implement it yourself..." - I expected that :) That why my main question was about algorithm, rather than about library. – qble Nov 14 '12 at 07:42
@Fezvez - is the above correct also for larger and sparse matrices? (larger means n:~700-~300 with ~1/16 of the matrix having non zero values)? Thank you. – rkellerm Sep 09 '14 at 12:21
Sorry, I am a bit rusty in this aspect of linear algebra. I hope you will find your answer (though, the best way is to benchmark it yourself) – B. Decoster Sep 11 '14 at 06:52

score 6 · Answer 2 · answered Nov 13 '12 at 23:06

6

Generally, one is best off using an existing library, rather than a roll-your-own approach, as there are many tedious details to attend to in pursuit of a fast, stable numerical implementation.

Here's a few to get you started:

Eigen library (my personal preference):
http://eigen.tuxfamily.org/dox/QuickRefPage.html#QuickRef_Headers

Armadillo: http://arma.sourceforge.net/

Search around and you'll find plenty of others.

answered Nov 13 '12 at 23:06

arr_sea

841
10
16

Although I agree that a library is in most cases the best choice, for such a smallish matrix the overhead associated with a full-fledged library could excessive. – Stefano M Nov 13 '12 at 23:14
I already considered Eigen. Thanks for Armadillo point, I will check it. – qble Nov 14 '12 at 07:20
[Armadillo](http://arma.sourceforge.net) is a good choice as it uses the well-tested LAPACK for LU decomposition, Cholesky decomposition, etc. You can also easily change the underlying LAPACK library with highly optimized drop-in replacements, such as Intel's [MKL libraries](http://software.intel.com/en-us/intel-mkl/). – mtall Nov 14 '12 at 09:08

score 3 · Answer 3 · answered Nov 13 '12 at 23:55

3

I would recommend LU decomposition, especially if "solved millions of times" really means "solved once and applied to millions of vectors". You'll create the LU decomposition, save it, and apply forward-back substitution against as many r.h.s. vectors as you wish.

It's more stable in the face of roundoff if you use pivoting.

answered Nov 13 '12 at 23:55

duffymo

305,152
44
369
561

No, sorry, I meant milions of different Ax=b systems (where A is different too). Anyway +1 – qble Nov 14 '12 at 07:02

score 2 · Answer 4 · answered Nov 14 '12 at 08:28

LU for a symmetric semi-definite matrix does not make much sense: you destroy a nice property of your input data performing unnecessary operations.

Choice between LLT or LDLT really depends on the condition number of your matrices, and how you intend to treat edge cases. LDLT should be used only if you can prove a statistically significant improve in accuracy, or if robustness is of paramount importance to your app.

(Without a sample of your matrices it is hard to give sound advice, but I suspect that with such a small order N=9, pivoting the small diagonal terms toward the bottom part of D is really not necessary. So I would start with classical Cholesky and simply abort factorization if the diag terms become to small with respect to some sensibly chosen tolerance.)

Cholesky is pretty simple to code, and if you strive for a really fast code, it is better to implement it yourself.

I am worried that "classical Cholesky" LLT involves N square roots. Maybe not an issue for big N, but for 9 - it is very suspicious. — qble, Nov 14 '12 at 08:49
You can go with LDLT without pivoting: you save N square roots at the cost of (N-1)(N-2)/2 multiply during factorization: so its about 9 square roots against 28 multiplies. This should be compared to the main cost, which is about (N^3-N)/6 madds (madd=multiply and add), so 720 madds. Impact on back solution is about N multiplies. If you worry about these numbers, no choice to implement a very aggressive subroutine of your own. For such problems you have to optimize at a very low level, taking into account memory bandwidth and latency for the different cache levels, CPU affinity, ... — Stefano M, Nov 14 '12 at 14:13

score 2 · Answer 5 · answered Nov 14 '12 at 12:17

Like others above, I recommend cholesky. I've found that the increased number of additions, subtractions and memory accesses means that LDLt is slower than cholesky.

There are in fact a number a number of variations on cholesky, and which one will be fastest depends on the representation you choose for your matrices. I generally use a fortran style representation, that is a matrix M is a double* M with M(i,j) being m[i+dim*j]; for this I reckon that an upper triangular cholesky is (a little) the fastest, that is one seeks upper triangular U with U'*U = M.

For fixed, small, dimension it is definitely worth considering writing a version that uses no loops. A relatively straightforward way to do this is to write a program to do it. As I recall, using a routine that deals with the general case as a template, it only took a morning to write a program that would write a specific fixed dimension version. The savings can be considerable. For example my general version takes 0.47 seconds to do a million 9x9 factorisations, while the loopless version takes 0.17 seconds -- these timings running single threaded on a 2.6GHz pc.

To show that this is not a major task, I've included the source of such a program below. It includes the general version of the factorisation as a comment. I've used this code in circumstances where the matrices are not close to singular, and I reckon it works ok there; however it may well be too crude for more delicate work.

/*  ----------------------------------------------------------------
**  to write fixed dimension ut cholesky routines
**  ----------------------------------------------------------------
*/
#include    <stdio.h>
#include    <stdlib.h>
#include    <math.h>
#include    <string.h>
#include    <strings.h>
/*  ----------------------------------------------------------------
*/
#if 0
static  inline  double  vec_dot_1_1( int dim, const double* x, const double* y)
{   
double  d = 0.0;
    while( --dim >= 0)
    {   d += *x++ * *y++;
    }
    return d;
}

   /*   ----------------------------------------------------------------
    **  ut cholesky: solve U'*U = P for ut U in P (only ut of P accessed)
    **  ----------------------------------------------------------------
    */   

int mat_ut_cholesky( int dim, double* P)
{
int i, j;
double  d;
double* Ucoli;

    for( Ucoli=P, i=0; i<dim; ++i, Ucoli+=dim)
    {   /* U[i,i] = P[i,i] - Sum{ k<i | U[k,i]*U[k,i]} */
        d = Ucoli[i] - vec_dot_1_1( i, Ucoli, Ucoli);
        if ( d < 0.0)
        {   return 0;
        }
        Ucoli[i] = sqrt( d);
        d = 1.0/Ucoli[i];
        for( j=i+1; j<dim; ++j)
        {   /* U[i,j] = (P[i,j] - Sum{ k<i | U[k,i]*U[k,j]})/U[i,i] */
            P[i+j*dim] = d*(P[i+j*dim] - vec_dot_1_1( i, Ucoli, P+j*dim));
        }
    }
    return 1;
}
/*  ----------------------------------------------------------------
*/
#endif

/*  ----------------------------------------------------------------
**
**  ----------------------------------------------------------------
*/
static  void    write_ut_inner_step( int dim, int i, int off)
{
int j, k, l;
    printf( "\td = 1.0/P[%d];\n", i+off);
    for( j=i+1; j<dim; ++j)
    {   k = i+j*dim;
        printf( "\tP[%d] = d * ", k);
        if ( i)
        {   printf( "(P[%d]", k);
            printf( " - (P[%d]*P[%d]", off, j*dim);
            for( l=1; l<i; ++l)
            {   printf( " + P[%d]*P[%d]", l+off, l+j*dim);
            }
            printf( "));");
        }
        else
        {   printf( "P[%d];", k);
        }
        printf( "\n");
    }
}

static  void    write_dot( int n, int off)
{   
int i;
    printf( "P[%d]*P[%d]", off, off);
    for( i=1; i<n; ++i)
    {   printf( "+P[%d]*P[%d]", off+i, off+i);
    }
}

static  void    write_ut_outer_step( int dim, int i, int off)
{
    printf( "\td = P[%d]", off+i);
    if ( i)
    {   printf( " - (");
        write_dot( i, off);
        printf( ")");
    }
    printf( ";\n");

    printf( "\tif ( d <= 0.0)\n");
    printf( "\t{\treturn 0;\n");
    printf( "\t}\n");

    printf( "\tP[%d] = sqrt( d);\n", i+off);
    if ( i < dim-1)
    {   write_ut_inner_step( dim, i, off);  
    }
}

static  void    write_ut_chol( int dim)
{
int i;
int off=0;
    printf( "int\tut_chol_%.2d( double* P)\n", dim);
    printf( "{\n");
    printf( "double\td;\n");
    for( i=0; i<dim; ++i)
    {   write_ut_outer_step( dim, i, off);
        printf( "\n");
        off += dim;
    }
    printf( "\treturn 1;\n");
    printf( "}\n");
}
/*  ----------------------------------------------------------------
*/


/*  ----------------------------------------------------------------
**
**  ----------------------------------------------------------------
*/
static  int read_args( int* dim, int argc, char** argv)
{
    while( argc)
    {   if ( strcmp( *argv, "-h") == 0)
        {   return 0;
        }
        else if (strcmp( *argv, "-d") == 0)
        {   --argc; ++argv;
            *dim = atoi( (--argc, *argv++));
        }
        else
        {   break;
        }
    }
    return 1;
}

int main( int argc, char** argv)
{
int dim = 9;
    if( read_args( &dim, --argc, ++argv))
    {   write_ut_chol( dim);
    }
    else
    {   fprintf( stderr, "usage: wchol (-d dim)? -- writes to stdout\n");
    }
    return EXIT_SUCCESS;
}
/*  ----------------------------------------------------------------
*/

+1 for the contributed code. Some level of loop unrolling can be done by the compiler, at the cost, sometimes of poor register allocation. The inner Chol loop contains 720 multiply and adds: so complete unrolling is feasible, as your code demonstrates, but I am not so sure that it will be better than a carefully coded loop version... — Stefano M, Nov 14 '12 at 14:35
I've not been impressed with gcc's loop unrolling. If I take the variable dimension code, remove the dim argument and replace it with 9, the resulting function executes in around 0.43 micro seconds, as against 0.47 for the variable dimension one, whereas the "handcoded" version was 0.17 — dmuir, Nov 14 '12 at 17:05
Perhaps you can provide a sample linear system to solve and how to input it as the matrix to this code. — David Doria, Apr 04 '13 at 18:44

score 0 · Answer 6 · answered Sep 09 '14 at 07:00

Because of its ease of use, you can take Eigen solvers just for comparison. For specific use case a specific solver might be faster although another is supposed to be better. For that, you can measure runtimes for the each algorithm just for the selection. After that you can implement the desired option (or find an existing one that fits your needs the best).

Fast solution of dense linear system of fixed dimension (N=9), symmetric, positive-semidefinite

6 Answers6

Linked