What is the reason for inner loop performance degradation after upgrade?

Question

I have a hand-rolled matrix algorithm which finds the largest number of a right lower square of a square matrix (thus when iterating, some parts are 'jumped' over) - stored as a dense matrix. After an update from vs2010 to vs2017 it seems to be much slower - about a 50% slowdown overall. After some investigation, this was located to the inner loop of a function finding absolute largest value. Looking at the assembler output, this seems be due to some extra mov instructions being inserted within the tight loop. Reworking the loop in different ways seems to solve or partly solve the issue. gcc doesn't seem to have this "issue" in comparison.

Simplified examples (fabs not always neccessary to reproduce):

#include <cmath>
#include <iostream>

int f_slow(double *A, size_t from, size_t w)
{
    double biga_absval = *A;
    size_t ir = 0,ic=0;
    for ( size_t j = 0; j < w; j++ ) {
      size_t n = j*w;
      for ( ; n < j*w+w; n++ ) {
        if ( fabs(A[n]) <= biga_absval ) {
          biga_absval = fabs( A[n] );
          ir   = j;
          ic   = n;
        }
        n++;
      }
    }

    std::cout << ir <<ic;
    return 0;
}

int f_fast(double *A, size_t from, size_t w)
{
    double* biga = A;
    double biga_absval = *biga;

    double* n_begin = A + from;
    double* n_end = A + w;
    for (double* A_n = n_begin; A_n < n_end; ++A_n) {
      if (fabs(*A_n) > biga_absval) {
        biga_absval = fabs(*A_n);
        biga = A_n;
      }
    }

    std::cout << biga;
    return 0;
}

int f_faster(double *A, size_t from, size_t w)
{
    double biga_absval = *A;
    size_t ir = 0,ic=0;
    for ( size_t j = 0; j < w; j++ ) {
      size_t n = j;
      for ( ; n < j*w+w; n++ ) {
        if ( fabs(A[n]) > biga_absval ) {
          biga_absval = fabs( A[n] );
          ir   = j;
          ic   = n - j*w;
        }
        n++;
      }
    }

    std::cout << ir <<ic;
    return 0;
}

Please note: examples were created to look at output only (and indexes etc. don't neccessarily make sense):

https://godbolt.org/z/q9rWwi

So my question is: is this just a (known?) optimizer bug (?) or is there some logic behind what in this case seems like a clear optimization miss ?

Using latest stable vs2017 15.9.5

Update: The extra movs I see is before the jump codes - easiest way to find in compiler explorer is to right click on the if and then "scroll to".

`f_fast` is checking every element of A (1x `++A_n`), `f_faster` only every second (2x `n++` in inner loop)... is it intentional? — Ped7g, Jan 29 '19 at 14:18
This question could benefit from some cleanup. There are three functions, but it's unclear which of the three is affected by the VS2017 slowdown. These functions don't do the same at all; 2 of the 3 even ignore the `from` parameter. — MSalters, Jan 29 '19 at 15:07
@MSalters while your right that they doesn't conform - it is mainly some different indexing - that shouldn't really change anything from how the inner loop is optimized (?) — darune, Jan 29 '19 at 18:51
@MatthieuBrucher: MSVC doesn't have a `/O3` option. It ignores it and you get the default debug-mode un-optimized code. `cl : Command line warning D9002 : ignoring unknown option '/O3'`. This is unlike gcc/clang, where `-O3` enables full optimization, including auto-vectorization with gcc. (clang enables auto-vec at -O2, but gcc only at -O3). — Peter Cordes, Jan 29 '19 at 22:00
I dealt with a similar problem in https://stackoverflow.com/questions/32511862/how-does-visual-studio-2013-detect-buffer-overrun due to changes in the default security settings of the C++ compiler in Visual Studio ~2013. Buffer overrun handling changed, but I guess that is not quite the same issue you are dealing with. — BlueMonkMN, Feb 07 '19 at 14:38

score 1 · Answer 1 · answered May 11 '20 at 22:35

1

Well, I don't know why VC gets worse in your case, but I would like to offer some hint how to safe some ops.

void f_faster( const double* A, const std::size_t w ) {
    double      biga_absval = A[ 0 ];
    std::size_t ir, ic_n;
    for ( std::size_t j = 0; j < w; ++j ) {
        const auto N = j * w + w;
        for ( std::size_t n = j; n < N; n += 2 ) {
            if ( const auto new_big_a = std::fabs( A[ n ] ); new_big_a > biga_absval ) {
                biga_absval = new_big_a;
                ir          = j;
                ic_n        = n;
            }
        }
    }

    std::cout << ir << ( ic_n - ir * w );
}

don't calculate ic in the inner loop, just store n for later use
use const to help the optimizer
don't evaluate std::fabs twice
post-increment creates a copy, that you don't need (probably optimized away)
store the loop's upper bound outside, otherwise it might be re-evaluated (probably optimized away)
just increment n by two, instead of two increments by one
don't initialize with unused values

Maybe that's already enough to get rid of the extra mov?

answered May 11 '20 at 22:35

malik

66
9

This can be auto-vectorized for x86 by incrementing and blending a vector of integer `n` values based on a vector AND (abs) / compare. Only profitable for `double` with AVX to do 4 doubles per vector, only 2 of which are useful since this has a stride of 2. (Ignore the odd elements when checking horizontally at the end of the loop.) Maybe not worth it overall, although with `float` it would be good with AVX. – Peter Cordes May 11 '20 at 23:44
care to throw up a link in compiler explorer for comparison ? – darune May 25 '20 at 13:12
I am not convinced that your suggestions help - ie. ends up in a faster assemply code - but feel free to modify my example on compiler explorer to prove me wrong. – darune May 25 '20 at 13:28
You are right, the difference is minimal, you can experiment here: http://quick-bench.com/Y7LD3P2QRwmrp8U0G6a_vyoQouY It also depends on the compiler. My guess is, that something in the code makes your compiler cough up and changing it slightly maybe works around your compiler issue. You can only know by compiling and benchmarking on your setup. Comparing the assembly is unreliable, because it is next to impossible to predict, what your CPU is actually doing in detail. – malik May 25 '20 at 18:02

What is the reason for inner loop performance degradation after upgrade?

1 Answers1