GCC is giving me a hard time generating optimal assembly for following source code:
memset(X, 0, 16);
for (int i= 0; i < 16; ++i) {
X[0] ^= table[i][Y[i]].asQWord;
}
X
being an uint64_t[2]
array, and
Y
being an unsigned char[16]
array, and
table
being a double dimensional array of union qword_t
:
union qword_t {
uint8_t asBytes[8];
uint64_t asQWord;
};
const union qword_t table[16][256] = /* ... */;
With options -m64 -Ofast -mno-sse
it does unroll the loop, and each xor with assignment results in 3 instructions (thus overall number of instructions issued is 3 * 16 = 48):
movzx r9d, byte ptr [Y + i] ; extracting byte
xor rax, qword ptr [table + r9*8 + SHIFT] ; xoring, SHIFT = i * 0x800
mov qword ptr [X], rax ; storing result
Now, my understanding is that resulting X value could be accumulated in rax
register throughout all 16 xors, and then it could be stored at [X]
address, which could be achieved with these two instructions for each xor with assignment:
movzx r9d, byte ptr [Y + i] ; extracting byte
xor rax, qword ptr [table + r9*8 + SHIFT] ; xoring, SHIFT = i * 0x800
and single storing:
mov qword ptr [X], rax ; storing result
(In this case overall number of instructions is 2 * 16 + 1 = 33)
Why does GCC generate these redundant mov
instructions? What can I do to avoid this?
P.S. C99, GCC 5.3.0, Intel Core i5 Sandy Bridge