Optimization Loop unrolling to find the inverse of a matrix by the gauss jordan method

Question

I am trying to apply the loop unrolling to find the inverse of a matrix by the Gauss Jorda method, to reduce the number of memory accesses (bottleneck) when the size of the matrices is very large and does not fit in the caches. I get it to go faster, but the result I get is wrong and I don't know why.

for(k=0;k<size;k++)                                  
{                                                       
    
    pivot=original[k][k];
    for(j=0;j<size;j++)                             
    {
        original[k][j]/=pivot;                                  
        inverse[k][j]/=pivot;                                   
    }   
            
    for(i=0;i<size;i++)                                 
    {
           
        if(i!=k)
        {
            pivot = original[i][k];                                 
            for(j=0;j<size;j++)                         
            {                                                                                   
                original[i][j] -= original[k][j]*pivot;                     
                inverse[i][j] -= inverse[k][j]*pivot;   
                                            
            }
        }
            
    }     
}

I hope that the execution of the problem is faster by receiving the number of memory accesses.

my loop unrrolling version is the following:

for(k=0;k<sizeOfMatrix;k += 2)                                   
{                                                           
    pivot=original[k][k];
    for(j=0;j<sizeOfMatrix;j++)                             
    {
        original[k][j]/=pivot;                                  
        inverse[k][j]/=pivot;                                   
    }   
           
    for(i=0;i<sizeOfMatrix;i++)                                 
    {
       
        if(i!=k && i!=k+1)
        {
            pivot = original[i][k];                                 
            for(j=0;j<sizeOfMatrix;j++)                         
            {                                                                                   
                original[i][j] -= original[k][j]*pivot;                     
                inverse[i][j] -= inverse[k][j]*pivot;                              
            }
        }
               
        if(i!=k+1)
        {
            pivot=original[k][k];
            for(j=0;j<sizeOfMatrix;j++)                             
            {
                original[k+1][j]/=pivot;                                    
                inverse[k+1][j]/=pivot;                                 
            }
                              
            pivot = original[i][k+1];                                   
            for(j=0;j<sizeOfMatrix;j++)                         
            {                                                                                   
                original[i][j] -= original[k+1][j]*pivot;                       
                inverse[i][j] -= inverse[k+1][j]*pivot;                              
            }
        }        

    }     
        
}

Commonly used C compilers already know all about optimization in general and loop unrolling in particular. Rarely is it appropriate to perform loop unrolling manually. Just compile with optimization enabled, and let the compiler determine whether and how to do any loop unrolling. — John Bollinger, Jan 25 '23 at 18:37
In any case, I don't see any evidence of manual loop unrolling in the code posted, so I'm not sure how you expect us to tell you what may have been wrong with your (unshown) unrolling attempt. — John Bollinger, Jan 25 '23 at 18:39
You said that the number of memory access is the bottleneck but I do not believe this is true assuming you enabled *compiler optimization flags* (eg. `-O3`) which is critical here. Mainstream compilers will use *SIMD instructions* in the last loop and your algorithm should be memory bound, more specifically bound by the *RAM throughput*, not the number of memory accesses made by the processor. The L1 cache will clearly be fast enough for that. Note that some compiler can also unroll the loop. Clang and ICC typically do that while GCC typically does not. Did you *profile* your code? — Jérôme Richard, Jan 25 '23 at 18:43
I do note, however, that the code shown will perform division by zero when `original[k][k]` is exactly zero. More generally, it will have numeric problems when the magnitude of `original[k][k]` is very small relative to other elements in the same column. — John Bollinger, Jan 25 '23 at 18:43
I am compiling it with gcc with the -Ofast option. I have profiled the code and it does not perform unrolling. The performance problem arises when I use, for example, values of type float, with an array size of 5000 x 5000. — subject25, Jan 25 '23 at 18:53
I have no idea how you think you can tell from profiling results whether the compiler performed any unrolling, but if it didn't, that would be because it doesn't think that unrolling would make the program faster. It could be wrong about that, but it's a much better judge of that sort of thing than you are (or than I am). — John Bollinger, Jan 25 '23 at 18:56
Yeah GCC tends not to unroll loops but it should be vectorized with SSE2. Consider using `-mavx` if the target processor support it, or just use `-march=native` if you do not care about portability over other processor architectures. Unrolling of the last loop should not help, but a 2D tiling can certainly help. It will make the code more complex but certainly faster due to a lower memory pressure (less data to transfer from/to the RAM). The thing is LAPACK libraries should do that very well. Please consider not reinventing the wheel unless the purpose of doing this is to learn how it works. — Jérôme Richard, Jan 25 '23 at 18:58

John Bollinger · Answer 1 · 2023-01-26T18:03:41.070

For the moment, let's leave aside the questions of the numerical fitness of your original code and of whether the unrolling is an exercise you should be attempting at all. You show an attempt to unroll the outer loop by rewriting the body of the middle loop. Although a computation equivalent to the original could be structured along the lines similar to those of the transformed result, it would not constitute an unrolling because loop unrolling does not change order of the operations performed in the loop body over the range of loop iterations. The transformation you have made does produce such changes.

But in your case, the transformed code is simply wrong, at least because

when sizeOfMatrix is odd, it performs out-of-bounds array accesses during the k-loop iteration where k == sizeOfMatrix - 1.
it uses the wrong pivot value at the beginning of the if(i!=k+1) block.
the correct value would, in principle, be stored at original[k+1][k+1], but the needed value has not yet been computed on iterations where i <= k.
the if(i!=k+1) block scales row k+1 of the original and inverse matrices for every i other than k+1, but those should be scaled only once each.

Without further consideration of whether a transformation along the lines you've attempted might lend itself to a performance improvement, I observe that if you wanted to perform a genuine unrolling then that's the wrong loop to be trying to unroll. If there were any advantage to be gained by a true unrolling then it would come from unrolling the innermost loop. Like this, for example:

for (k = 0; k < size; k++) {                                                       
    // ...
    for (i = 0; i < size; i++) {
        // ...
            for (j = 0; j < size ^ 1; j += 2) {
                // two rows per iteration                                                                                  
                original[i][j] -= original[k][j] * pivot;
                original[i][j + 1] -= original[k][j + 1] * pivot;
                inverse[i][j] -= inverse[k][j] * pivot;
                inverse[i][j + 1] -= inverse[k][j + 1] * pivot;
            }
            if (size & 1) {
                // a single additional row when needed
                assert(j == size - 1);
                original[i][j] -= original[k][j] * pivot;
                inverse[i][j] -= inverse[k][j] * pivot;
            }
        // ...
    }
}

But as I remarked in comments, if your compiler expected such an unrolling to speed up the program then it would do it automatically when you compile with optimization enabled. And the compiler's judgement about such things is pretty good.

In fact, hand-optimizations can easily worsen performance instead of improving it, either inherently or because they interfere with the compiler performing more effective optimizations itself.

If best speed is your objective then you should probably use a well-tuned third-party implementation, such as from lapack or atlas.

If you insist on rolling your own, then you likely could get more speedup simply by improving the algorithm (even sticking with G-J), but you should probably give some of that back to obtain improved numeric stability. Specifically,

at each outer-loop iteration, choose the available row with the largest-magnitude value in the current elimination column, and swap it, if necessary, with the current row. This will minimize the incidence of dividing large numbers by very small numbers, especially zero.
(after (1), above,) in the original matrix, compute updates only for the elements in columns to the right of the current pivot. The other elements of the original have already had all the influence on the result that they ever will have. (But you cannot shortcut this way with the inverse matrix.)

If you do not perform some variation on (1) then your inversion will fail on some matrices that are, in fact, invertible.

Something along these lines, for example:

// Temporary space for row swaps
element_type *tmp = malloc(size * sizeof(*tmp));
if (!tmp) {
    abort();  // you probably want something more graceful here
}

for (int pivot_col = 0; pivot_col < size; pivot_col++) {

    // Choose the remaining row that maximizes the magnitude of the pivot element

    int pivot_row = pivot_col;
    element_type pivot = fabs(original[pivot_row][pivot_col]);  // or fabsf() or fabsl() ...
                                                     
    for (row = pivot_row + 1; row < size; row++) {
        element_type test_el = fabs(original[row][pivot_col]);

        if (test_el > pivot) {
            pivot_row = row;
            pivot = test_el;
        }
    }

    if (pivot_row != pivot_col) {

        // This assumes array-of-array structure.
        // There are better options for array-of-pointers structure.

        int current_row = pivot_col;

        // Only the tails of the original rows, starting at pivot_col, need to be swapped
        memcpy(tmp, &original[current_row][pivot_col], (size - pivot_col) * sizeof(*tmp));
        memcpy(&original[current_row][pivot_col], &original[pivot_row][pivot_col], (size - pivot_col) * sizeof(*tmp));
        memcpy(&original[pivot_row][pivot_col], tmp, (size - pivot_col) * sizeof(*tmp));

        // The full inverse rows need to be swapped
        memcpy(tmp, inverse[current_row], sizeof(inverse[current_row]));
        memcpy(inverse[current_row], inverse[pivot_row], sizeof(inverse[current_row]));
        memcpy(inverse[pivot_row], tmp, sizeof(inverse[current_row]));
    }
    pivot_row = pivot_col;
    pivot = original[pivot_row][pivot_col];  // get the correctly-signed pivot value

    // All columns of the inverse row need to be scaled, but
    // only the tail columns of the original row need to be.
    int col = 0;

    for (; col < pivot_col + 1; col++) {
        inverse[pivot_row][col] /= pivot;                                   
    }   
    for (; col < size; col++) {
        original[pivot_row][col] /= pivot;                                  
        inverse[pivot_row][col] /= pivot;                                   
    }   
            
    // Compute the effect on the other rows of canceling the pivot column
    for (int row = 0; row < size; row++) {
        if (row == pivot_row) continue;

        element_type cancel_val = original[row][pivot_col];

        // where col < pivot_col, original[pivot_row][col] is (logically) zero
        // where col == pivot_col, we are (logically) setting original[row][col] to zero
        for (col = 0; col <= pivot_col; col++) {
            inverse[row][col] -= inverse[pivot_row][col] * cancel_val;   
        }
        // we need actually to write updates to the remaining columns of the original
        for(; col < size; col++) {
            original[row][col] -= original[pivot_row][col] * cancel_val;
            inverse[row][col] -= inverse[pivot_row][col] * cancel_val;
        }
    }     
}

free(tmp);

That's inspired by your first code, with improved names and an implementation of algorithmic improvements (1) and (2).

score 0 · Accepted Answer · answered Jan 28 '23 at 19:40

You already found the solution to the problem. As the objective is academic. The goal is to reduce the execution time, reduce the number of memory accesses, in my case reduce it by half.

I modify the row k, I divide the elements of the row by its pivot
Modify the row k+1 of the matrices, with the new row k
Modified row k+1, divides the elements of the row by its pivot.
I make the modifications of the other rows with the values k and k + 1

    for (k = 0; k < size; k += 2)
    {
        pivot = original[k][k];
        for (j = 0; j < size; j++)
        {
            original[k][j] /= pivot;
            inverse[k][j] /= pivot;

        }
        pivot = original[k + 1][k];
        for (i = 0; i < size; i++)
        {
            original[k + 1][i] -= original[k][i] * pivot;
            inverse[k + 1][i] -= inverse[k][i] * pivot;
        }
        pivot = original[k+1][k+1];

        for (j = 0; j < size; j++)
        {
            original[k+1][j] /= pivot;
            inverse[k+1][j] /= pivot;
        }
        for (i = 0; i < size; i++)
        {
            if (i != k && i != k + 1)
            {
                pivot = original[i][k];
                
                    for (j = 0; j < size; j++)
                    {
                        original[i][j] -= original[k][j] * pivot;
                        inverse[i][j] -= inverse[k][j] * pivot;
                    }
            }

            if (i != k + 1)
            {
                pivot = original[i][k+1];
                for (j = 0; j < size; j++)
                {
                    original[i][j] -= original[k + 1][j] * pivot;
                    inverse[i][j] -= inverse[k + 1][j] * pivot;

                }
            }
        }
    }

Optimization Loop unrolling to find the inverse of a matrix by the gauss jordan method

2 Answers2