I am wondering what the optimal order is for a sequence of instructions like the one below on Intel processors between Core 2 and Westmere. This is AT&T syntax, so that the pxor
instructions are memory reads, and the movdqa
are memory writes:
movdqa %xmm0, -128+64(%rbx)
movdqa %xmm1, -128+80(%rbx)
movdqa %xmm2, -128+96(%rbx)
movdqa %xmm3, -128+112(%rbx)
pxor -128(%rsp), %xmm0
pxor -112(%rsp), %xmm1
pxor -96(%rsp), %xmm2
pxor -80(%rsp), %xmm3
movdqa %xmm8, 64(%rbx)
movdqa %xmm9, 80(%rbx)
movdqa %xmm10, 96(%rbx)
movdqa %xmm11, 112(%rbx)
pxor -128(%r14), %xmm8
pxor -112(%r14), %xmm9
pxor -96(%r14), %xmm10
pxor -80(%r14), %xmm11
movdqa %xmm12, 64(%rdx)
movdqa %xmm13, 80(%rdx)
movdqa %xmm14, 96(%rdx)
movdqa %xmm15, 112(%rdx)
pxor 0(%r14), %xmm12
pxor 16(%r14), %xmm13
pxor 32(%r14), %xmm14
pxor 48(%r14), %xmm15
%r14
, %rsp
, %rdx
, and %rbx
are distinct multiples of 256. In other words, there are no non-obvious aliases in the instructions above and data has been laid out for aligned access to large blocks of data. All the memory lines being accessed are in the L1 cache.
On the one hand, my understanding of Agner Fog's optimization guides make me believe that it may be possible to get close to two instructions by cycle with an ordering like the one below:
movdqa %xmm0, -128+64(%rbx)
movdqa %xmm1, -128+80(%rbx)
pxor -128(%rsp), %xmm0
movdqa %xmm2, -128+96(%rbx)
pxor -112(%rsp), %xmm1
movdqa %xmm3, -128+112(%rbx)
pxor -96(%rsp), %xmm2
movdqa %xmm8, 64(%rbx)
pxor -80(%rsp), %xmm3
movdqa %xmm9, 80(%rbx)
pxor -128(%r14), %xmm8
movdqa %xmm10, 96(%rbx)
pxor -112(%r14), %xmm9
movdqa %xmm11, 112(%rbx)
pxor -96(%r14), %xmm10
movdqa %xmm12, 64(%rdx)
pxor -80(%r14), %xmm11
movdqa %xmm13, 80(%rdx)
pxor 0(%r14), %xmm12
movdqa %xmm14, 96(%rdx)
pxor 16(%r14), %xmm13
movdqa %xmm15, 112(%rdx)
pxor 32(%r14), %xmm14
pxor 48(%r14), %xmm15
This ordering attempts to take into account “cache bank conflicts” as described in Agner Fog's microachitecture.pdf by leaving an offset between the reads and the writes.
On the other hand, another concern is that although the programmer knows that there are no aliases in the code above, they have no way to convey this information to the processor. May the interleaving of reads and writes introduce delays because of the processor has to take into account the possibility that a read is of a value that was modified by a write in the above instructions? In that case obviously it would be better to do all the reads first, but since this is not possible for that particular sequence of instructions, perhaps getting all the writes done first would make sense.
In short, there seems to be many possibilities here, and my intuition is not good enough to get a feeling of what is likely to happen with each of them.
EDIT: if that matters, the code that comes before the sequence under consideration has been either loading the xmm
registers from memory or computing them with arithmetic instructions, and the code that comes after uses these registers either to write them to memory or as inputs to arithmetic instructions. The memory locations that have been written to are not reused immediately. rbx
, rsp
, r14
and rdx
are long-lived registers that have to come from the register file.