0

I'm writing a program using Intel AVX2 instructions. I found a bug in my program which appears only with optimization level -O2 or higher (With -O1 it's good). After extensive debugging, I narrow down the buggy region. Now the bug seems to be caused by the compiler incorrect optimizing out a simple copy assignment of __m256i variable.

Consider the following code snippet. Foo is a templated function. I test with CMP = kLess, OPT=kSet. I'm aware that the optimizer will probably optimize out the switches. It may even optimize out the variable y.

The buggy line is y = m_lt;. When compiled with -O2, this line seems to be ignored. Then y doesn't get the right value and the program generates wrong result. However the program is correct with -O1.

To verify my judgement, I replace y = m_lt; with two alternatives:

y = avx_or(m_lt, avx_zero()); takes bitwise OR of m_lt and an all-0's vector

y = _mm256_load_si256(&m_lt); use the SIMD load instruction to load data from the address of m_lt.

Both should be semantically equivalent to y = m_lt; My intention is to prevent some optimization by adding some functions. The program works correctly with these two replacements under all optimization levels. So the problem is weird. To my knowledge, direct assignment of SIMD variables is definitely okay (I used a lot before). Will it be the problem related to the compiler?

typedef __m256i AvxUnit;

template <Comparator CMP, Bitwise OPT>
void Foo(){
    AvxUnit m_lt;
    //...

assert(!avx_iszero(m_lt));   //always pass

AvxUnit y;

switch(CMP){
    case Comparator::kEqual:
        y = m_eq;
        break;
    case Comparator::kInequal:
        y = avx_not(m_eq);
        break;
    case Comparator::kLess:
        y = m_lt;   //**********Bug?*************
        //y = avx_or(m_lt, avx_zero());   //Replace with this line is good.
        //y = _mm256_load_si256(&m_lt);   //Replace with this line is good too.
        break;
    case Comparator::kGreater:
        y = m_gt;
        break;
    case Comparator::kLessEqual:
        y = avx_or(m_lt, m_eq);
        break;
    case Comparator::kGreaterEqual:
        y = avx_or(m_gt, m_eq);
        break;
}

switch(OPT){
    case Bitwise::kSet:
        break;
    case Bitwise::kAnd:
        y = avx_and(y, bvblock->GetAvxUnit(bv_word_id));
        break;
    case Bitwise::kOr:
        y = avx_or(y, bvblock->GetAvxUnit(bv_word_id));
        break;
}

assert(!avx_iszero(y));   //pass with -O1, fail with -O2 or higher

bvblock->SetAvxUnit(y, bv_word_id);
//...
}
Mysticial
  • 464,885
  • 45
  • 335
  • 332
Neo1989
  • 285
  • 3
  • 14
  • Maybe a side note, but does `y = avx_or(m_lt, avx_ones());` really allow things to run correctly? It should give a value of all ones...? – Joachim Isaksson Oct 09 '14 at 09:21
  • @JoachimIsaksson oh sorry that's a mistake. I've corrected. – Neo1989 Oct 09 '14 at 09:24
  • 1
    Wouldn't the intrinsics be faster than direct assignment anyway? Not saying you didn't hit a bug, but it may be faster to actually work around it. – Joachim Isaksson Oct 09 '14 at 09:39
  • If you believe there is a compiler bug, first produce a SSCCE, and if doing the reduction didn't point to a problem in your code, post it to gcc's bugzilla. That's almost the only way of moving things forward. – Marc Glisse Oct 13 '14 at 13:27
  • Which version is this? – Surt Oct 13 '14 at 17:35
  • Please post a [MCVE](http://stackoverflow.com/help/mcve) . Possibly code elsewhere in your program is causing undefined behaviour, so this is the only way to be sure. – M.M Jul 28 '15 at 23:22
  • Also include compiler version and platform - g++ for Windows has some issues with intrinsic types wider than 64bit – M.M Jul 28 '15 at 23:22

1 Answers1

1

The reason for which the compiler would away the assignment is probably that it believes that line of code to be dead code. So your CMP is not likely to be Comparator::kLess.

The assignments you try as a workaround could be implemented using __asm__ volatile statements and they cannot be optimized.

Declaring m_lt as volatile won't probably impact greatly your performance but it's a dirty hack to fix it. I would look more on CMP variable and see if it can take also the kLess value.

VAndrei
  • 5,420
  • 18
  • 43
  • Thanks for the reminder. But the point is I DO want the compiler to optimize it, but of course, without losing correctness. – Neo1989 Oct 09 '14 at 09:11
  • You could try volatile on "m_lt" variable. y variable will still be optimized. But please try both cases wher also y is volatile. I don't expect the code to get much from this optimization. The big perf gain is because you use SIMD. – VAndrei Oct 09 '14 at 09:14