14

I have a hand-rolled matrix algorithm which finds the largest number of a right lower square of a square matrix (thus when iterating, some parts are 'jumped' over) - stored as a dense matrix. After an update from to it seems to be much slower - about a 50% slowdown overall. After some investigation, this was located to the inner loop of a function finding absolute largest value. Looking at the output, this seems be due to some extra instructions being inserted within the tight loop. Reworking the loop in different ways seems to solve or partly solve the issue. doesn't seem to have this "issue" in comparison.

Simplified examples (fabs not always neccessary to reproduce):

#include <cmath>
#include <iostream>

int f_slow(double *A, size_t from, size_t w)
{
    double biga_absval = *A;
    size_t ir = 0,ic=0;
    for ( size_t j = 0; j < w; j++ ) {
      size_t n = j*w;
      for ( ; n < j*w+w; n++ ) {
        if ( fabs(A[n]) <= biga_absval ) {
          biga_absval = fabs( A[n] );
          ir   = j;
          ic   = n;
        }
        n++;
      }
    }

    std::cout << ir <<ic;
    return 0;
}

int f_fast(double *A, size_t from, size_t w)
{
    double* biga = A;
    double biga_absval = *biga;

    double* n_begin = A + from;
    double* n_end = A + w;
    for (double* A_n = n_begin; A_n < n_end; ++A_n) {
      if (fabs(*A_n) > biga_absval) {
        biga_absval = fabs(*A_n);
        biga = A_n;
      }
    }

    std::cout << biga;
    return 0;
}

int f_faster(double *A, size_t from, size_t w)
{
    double biga_absval = *A;
    size_t ir = 0,ic=0;
    for ( size_t j = 0; j < w; j++ ) {
      size_t n = j;
      for ( ; n < j*w+w; n++ ) {
        if ( fabs(A[n]) > biga_absval ) {
          biga_absval = fabs( A[n] );
          ir   = j;
          ic   = n - j*w;
        }
        n++;
      }
    }

    std::cout << ir <<ic;
    return 0;
}

Please note: examples were created to look at output only (and indexes etc. don't neccessarily make sense):

https://godbolt.org/z/q9rWwi

So my question is: is this just a (known?) optimizer bug (?) or is there some logic behind what in this case seems like a clear optimization miss ?

Using latest stable 15.9.5

Update: The extra s I see is before the jump codes - easiest way to find in compiler explorer is to right click on the if and then "scroll to".

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
darune
  • 10,480
  • 2
  • 24
  • 62
  • 2
    `f_fast` is checking every element of A (1x `++A_n`), `f_faster` only every second (2x `n++` in inner loop)... is it intentional? – Ped7g Jan 29 '19 at 14:18
  • The code is VERY different between /O2 and /O3. – Matthieu Brucher Jan 29 '19 at 14:21
  • 1
    This question could benefit from some cleanup. There are three functions, but it's unclear which of the three is affected by the VS2017 slowdown. These functions don't do the same at all; 2 of the 3 even ignore the `from` parameter. – MSalters Jan 29 '19 at 15:07
  • @MSalters It is the slow one that compares with VS2010 – darune Jan 29 '19 at 18:46
  • @MSalters while your right that they doesn't conform - it is mainly some different indexing - that shouldn't really change anything from how the inner loop is optimized (?) – darune Jan 29 '19 at 18:51
  • 2
    @MatthieuBrucher: MSVC doesn't have a `/O3` option. It ignores it and you get the default debug-mode un-optimized code. `cl : Command line warning D9002 : ignoring unknown option '/O3'`. This is unlike gcc/clang, where `-O3` enables full optimization, including auto-vectorization with gcc. (clang enables auto-vec at -O2, but gcc only at -O3). – Peter Cordes Jan 29 '19 at 22:00
  • 2
    I dealt with a similar problem in https://stackoverflow.com/questions/32511862/how-does-visual-studio-2013-detect-buffer-overrun due to changes in the default security settings of the C++ compiler in Visual Studio ~2013. Buffer overrun handling changed, but I guess that is not quite the same issue you are dealing with. – BlueMonkMN Feb 07 '19 at 14:38
  • 1
    Could be Spectre mitigation slowing things down – 0x777C May 16 '19 at 15:03
  • @faissaloo it doesn't seem to be the case – darune May 20 '19 at 11:00

1 Answers1

1

Well, I don't know why VC gets worse in your case, but I would like to offer some hint how to safe some ops.

void f_faster( const double* A, const std::size_t w ) {
    double      biga_absval = A[ 0 ];
    std::size_t ir, ic_n;
    for ( std::size_t j = 0; j < w; ++j ) {
        const auto N = j * w + w;
        for ( std::size_t n = j; n < N; n += 2 ) {
            if ( const auto new_big_a = std::fabs( A[ n ] ); new_big_a > biga_absval ) {
                biga_absval = new_big_a;
                ir          = j;
                ic_n        = n;
            }
        }
    }

    std::cout << ir << ( ic_n - ir * w );
}
  • don't calculate ic in the inner loop, just store n for later use
  • use const to help the optimizer
  • don't evaluate std::fabs twice
  • post-increment creates a copy, that you don't need (probably optimized away)
  • store the loop's upper bound outside, otherwise it might be re-evaluated (probably optimized away)
  • just increment n by two, instead of two increments by one
  • don't initialize with unused values

Maybe that's already enough to get rid of the extra mov?

malik
  • 66
  • 9
  • This can be auto-vectorized for x86 by incrementing and blending a vector of integer `n` values based on a vector AND (abs) / compare. Only profitable for `double` with AVX to do 4 doubles per vector, only 2 of which are useful since this has a stride of 2. (Ignore the odd elements when checking horizontally at the end of the loop.) Maybe not worth it overall, although with `float` it would be good with AVX. – Peter Cordes May 11 '20 at 23:44
  • care to throw up a link in compiler explorer for comparison ? – darune May 25 '20 at 13:12
  • I am not convinced that your suggestions help - ie. ends up in a faster assemply code - but feel free to modify my example on compiler explorer to prove me wrong. – darune May 25 '20 at 13:28
  • You are right, the difference is minimal, you can experiment here: http://quick-bench.com/Y7LD3P2QRwmrp8U0G6a_vyoQouY It also depends on the compiler. My guess is, that something in the code makes your compiler cough up and changing it slightly maybe works around your compiler issue. You can only know by compiling and benchmarking on your setup. Comparing the assembly is unreliable, because it is next to impossible to predict, what your CPU is actually doing in detail. – malik May 25 '20 at 18:02