20

GCC is giving me a hard time generating optimal assembly for following source code:

memset(X, 0, 16);
for (int i= 0; i < 16; ++i) {
    X[0] ^= table[i][Y[i]].asQWord;
}

X being an uint64_t[2] array, and
Y being an unsigned char[16] array, and
table being a double dimensional array of union qword_t:

union qword_t {
    uint8_t asBytes[8];
    uint64_t asQWord;
};

const union qword_t table[16][256] = /* ... */;

With options -m64 -Ofast -mno-sse it does unroll the loop, and each xor with assignment results in 3 instructions (thus overall number of instructions issued is 3 * 16 = 48):

movzx  r9d, byte ptr [Y + i]                   ; extracting byte
xor    rax, qword ptr [table + r9*8 + SHIFT]   ; xoring, SHIFT = i * 0x800
mov    qword ptr [X], rax                      ; storing result

Now, my understanding is that resulting X value could be accumulated in rax register throughout all 16 xors, and then it could be stored at [X] address, which could be achieved with these two instructions for each xor with assignment:

movzx  r9d, byte ptr [Y + i]                   ; extracting byte
xor    rax, qword ptr [table + r9*8 + SHIFT]   ; xoring, SHIFT = i * 0x800

and single storing:

mov    qword ptr [X], rax                      ; storing result

(In this case overall number of instructions is 2 * 16 + 1 = 33)

Why does GCC generate these redundant mov instructions? What can I do to avoid this?

P.S. C99, GCC 5.3.0, Intel Core i5 Sandy Bridge

200_success
  • 7,286
  • 1
  • 43
  • 74
aprelev
  • 390
  • 2
  • 11
  • 2
    FYI: not directly related to your question, but this site could be useful for you: https://gcc.godbolt.org/ – Jabberwocky Apr 13 '16 at 08:25
  • 2 questions: Have you tried to compile it with -O2 and -O3? And the second question is can it be that `X` is defined as `volatile`? – Alex Lop. Apr 13 '16 at 08:39
  • @AlexLop 1) -O3 produces literally the same assembly; -O2 does not even unroll the loop. 2) No, `X` is not defined as volatile, it is `malloc`'-ed in top level functions and then passed by pointer. – aprelev Apr 13 '16 at 08:46
  • 8
    It would be clearer and more effective (and portable) to simply accumulate the xor values in a temp, then *after* the loop do `X[0] ^= temp;`. The optimizer doesn't necessarily know that the references to `X` aren't aliases to some of the other references within the loop. – Tom Karzes Apr 13 '16 at 08:51
  • gcc doesn't enable `-funroll-loops` even at `-Ofast`. It is enabled with `-fprofile-use`, because then it has enough information to decide which loops to unroll. Modern CPUs like compact code, esp. CPUs that have a uop-cache, so it's only a good idea for really hot loops. Still, unrolling tiny loops a bit is good. clang defaults to unrolling by 4 (at `-O3`). – Peter Cordes Apr 13 '16 at 09:23
  • Out of curiosity, what happens to your code if you replace the `memset` with a plain for loop? Seems this is related to "effective type" and pointer aliasing. – Lundin Apr 13 '16 at 11:54

2 Answers2

44

Redundant stores are usually down to aliasing; in this case gcc would be unable to prove to its satisfaction that the store to X[0] does not affect table. It makes a big difference how the variables are passed to the routine; if they are globals or members of the same larger struct then proving non-aliasing is easier.

Example:

void f1(uint64_t X[2]) {
  memset(X, 0, 16);
  for (int i= 0; i < 16; ++i) {
    X[0] ^= table[i][Y[i]].asQWord;
  }
}

uint64_t X[2];
void f2() {
  memset(X, 0, 16);
  for (int i= 0; i < 16; ++i) {
    X[0] ^= table[i][Y[i]].asQWord;
  }
}

Here the store to X[0] is sunk out of the loop in f2 but not in f1, because only in f2 can gcc prove that X does not alias members of table.

Your workaround/fix could be to adjust how the parameters are passed, to use the restrict specifier, or to manually sink the store yourself.

ecatmur
  • 152,476
  • 27
  • 293
  • 366
  • 3
    terminology: apparently compiler devs call it "sinking" the store out of the loop. It's the same idea as hoisting loop-invariant calculations out of a loop, but in the other direction. Also: Intel-syntax output makes it much easier to see the stores in the godbolt output. The QWORD PTR as the first operand stands out a lot more. gcc fully unrolls in the version you linked so there's a wall-of-code effect, esp. for the AT&T syntax. (I might have used `-fno-unroll-loops` for better human readability.). – Peter Cordes Apr 13 '16 at 09:27
9

To avoid this, you could use this instead:

uint64_t v = 0;
for (int i= 0; i < 16; ++i) {
    v ^= table[i][Y[i]].asQWord;
}
X[0] = v;
X[1] = 0;

You can easily notice the generated instructions are sub-optimal in your case, however for different reasons gcc may not be able to determine that. (And in this case, gcc cannot determine that table will never access the same memory-region as X, as ecatmur explains more elaborately.)

Elijan9
  • 1,269
  • 2
  • 17
  • 18