4

I am wondering what the optimal order is for a sequence of instructions like the one below on Intel processors between Core 2 and Westmere. This is AT&T syntax, so that the pxor instructions are memory reads, and the movdqa are memory writes:

    movdqa  %xmm0, -128+64(%rbx)
    movdqa  %xmm1, -128+80(%rbx)
    movdqa  %xmm2, -128+96(%rbx)
    movdqa  %xmm3, -128+112(%rbx)
    pxor    -128(%rsp), %xmm0
    pxor    -112(%rsp), %xmm1
    pxor    -96(%rsp), %xmm2
    pxor    -80(%rsp), %xmm3
    movdqa  %xmm8, 64(%rbx)
    movdqa  %xmm9, 80(%rbx)
    movdqa  %xmm10, 96(%rbx)
    movdqa  %xmm11, 112(%rbx)
    pxor    -128(%r14), %xmm8
    pxor    -112(%r14), %xmm9
    pxor    -96(%r14), %xmm10
    pxor    -80(%r14), %xmm11
    movdqa  %xmm12, 64(%rdx)
    movdqa  %xmm13, 80(%rdx)
    movdqa  %xmm14, 96(%rdx)
    movdqa  %xmm15, 112(%rdx)
    pxor    0(%r14), %xmm12
    pxor    16(%r14), %xmm13
    pxor    32(%r14), %xmm14
    pxor    48(%r14), %xmm15

%r14, %rsp, %rdx, and %rbx are distinct multiples of 256. In other words, there are no non-obvious aliases in the instructions above and data has been laid out for aligned access to large blocks of data. All the memory lines being accessed are in the L1 cache.

On the one hand, my understanding of Agner Fog's optimization guides make me believe that it may be possible to get close to two instructions by cycle with an ordering like the one below:

movdqa  %xmm0, -128+64(%rbx)
movdqa  %xmm1, -128+80(%rbx)
pxor    -128(%rsp), %xmm0
movdqa  %xmm2, -128+96(%rbx)
pxor    -112(%rsp), %xmm1
movdqa  %xmm3, -128+112(%rbx)
pxor    -96(%rsp), %xmm2
movdqa  %xmm8, 64(%rbx)
pxor    -80(%rsp), %xmm3
movdqa  %xmm9, 80(%rbx)
pxor    -128(%r14), %xmm8
movdqa  %xmm10, 96(%rbx)
pxor    -112(%r14), %xmm9
movdqa  %xmm11, 112(%rbx)
pxor    -96(%r14), %xmm10
movdqa  %xmm12, 64(%rdx)
pxor    -80(%r14), %xmm11
movdqa  %xmm13, 80(%rdx)
pxor    0(%r14), %xmm12
movdqa  %xmm14, 96(%rdx)
pxor    16(%r14), %xmm13
movdqa  %xmm15, 112(%rdx)
pxor    32(%r14), %xmm14
pxor    48(%r14), %xmm15 

This ordering attempts to take into account “cache bank conflicts” as described in Agner Fog's microachitecture.pdf by leaving an offset between the reads and the writes.

On the other hand, another concern is that although the programmer knows that there are no aliases in the code above, they have no way to convey this information to the processor. May the interleaving of reads and writes introduce delays because of the processor has to take into account the possibility that a read is of a value that was modified by a write in the above instructions? In that case obviously it would be better to do all the reads first, but since this is not possible for that particular sequence of instructions, perhaps getting all the writes done first would make sense.

In short, there seems to be many possibilities here, and my intuition is not good enough to get a feeling of what is likely to happen with each of them.

EDIT: if that matters, the code that comes before the sequence under consideration has been either loading the xmm registers from memory or computing them with arithmetic instructions, and the code that comes after uses these registers either to write them to memory or as inputs to arithmetic instructions. The memory locations that have been written to are not reused immediately. rbx, rsp, r14 and rdx are long-lived registers that have to come from the register file.

Pascal Cuoq
  • 79,187
  • 7
  • 161
  • 281
  • 1
    This is one of several reasons why coding SIMD with intrinsics is generally a better idea than writing raw asm - let the compiler do the instruction scheduling etc for you - that way you can even re-compile for different architectures and get optimal code for each. – Paul R Mar 07 '15 at 15:29
  • @PaulR I have read enough non-SIMD compiler-generated assembly code to be convinced that picking the worst among the sensible-looking choices would still likely be better than what a compiler would generate. – Pascal Cuoq Mar 07 '15 at 15:37
  • 1
    Perhaps, but SIMD is a rather different scenario - each infrinsic typically maps to a single instruction so the code generator just has to take care of instruction scheduling, register allocation, peephole optimisation, etc - in general most compilers do a pretty good job of this, so for anything non-trivial I find that Intrinsics result in pretty tight code. YMMV of course. – Paul R Mar 07 '15 at 15:51
  • @PaulR For the kind of code I am considering, you have two options: work on several small arrays, which help the compiler with aliases but means it either exhausts its registers or generates 9-byte instructions with large offsets, meaning that two will never be executed in one cycle. Or use one large array, meaning that the compiler is powerless with aliases and will have to use the order of the source code. – Pascal Cuoq Mar 07 '15 at 16:06
  • @PaulR The code I am looking at is an already somewhat tightened version of `scrypt_core_3way` in https://github.com/pooler/cpuminer/blob/master/scrypt-x64.S . If you think a compiler can do better, please show how. Many will be interested. – Pascal Cuoq Mar 07 '15 at 16:09
  • It would certainly be interesting to benchmark compiler-generated code using intrinsics versus hand-coded asm for this. – Paul R Mar 07 '15 at 16:14
  • I doubt you can even measure any performance difference on any out-of-order x86 micro-architecture. In fact, aliasing may be a red herring for this: If all memory accesses are of the same type, size and alignment, load-store forwarding might make even the aliasing case fast. If there is no aliasing, the best chance for finding performance differences between the implementations would be on an in-order machine like an Atom. – EOF Mar 08 '15 at 01:38
  • @EOF This is not intended to run on an Atom, so maximizing performance on the Atom would be a massive case of looking for the keys under the lamppost. One **can** measure the number of cycles taken by this sort of sequence (how do you think Agner Fog wrote his manuals?) and in fact I have been doing so since I posted this question. The sequence I suggested **is** measurably worse than the original. I will post a writeup soon. – Pascal Cuoq Mar 08 '15 at 01:43
  • This looks like code that could benefit from prefetching. e.g., `_mm_prefetch` intrinsic. gcc-4.9.2 (or even gcc-5.x) with `-march=` and intrinsics would be more interesting than simply dismissing compiler code generation altogether. – Brett Hale Mar 09 '15 at 09:49

1 Answers1

2

I instrumented the instructions I was interested in and the surrounding ones as below in order to measure the number of cycles taken by different ordering options in the context in which the instructions are used:

#ifdef M    
    push    %rdx
    push    %rax
    push    %rbx
    push    %rcx    
    xorq    %rax, %rax
    cpuid
    rdtsc
    movl    %eax, 256+32+UNUSED_64b
    movl    %edx, 256+32+4+UNUSED_64b
    pop     %rcx        
    pop     %rbx
    pop %rax
    pop %rdx
#endif  
    movdqa  %xmm0, -128+64(%rbx)
    movdqa  %xmm1, -128+80(%rbx)
    movdqa  %xmm2, -128+96(%rbx)
    movdqa  %xmm3, -128+112(%rbx)

    movdqa  %xmm8, 64(%rbx)
    movdqa  %xmm9, 80(%rbx)
    movdqa  %xmm10, 96(%rbx)
    movdqa  %xmm11, 112(%rbx)

    pxor    -128(%rsp), %xmm0   
    pxor    -112(%rsp), %xmm1
    pxor    -96(%rsp), %xmm2    
    pxor    -80(%rsp), %xmm3

    movdqa  %xmm12, 64(%rdx)
    movdqa  %xmm13, 80(%rdx)
    movdqa  %xmm14, 96(%rdx)
    movdqa  %xmm15, 112(%rdx)

    pxor    -128(%r14), %xmm8   
    pxor    -112(%r14), %xmm9
    pxor    -96(%r14), %xmm10
    pxor    -80(%r14), %xmm11

    movdqa  %xmm0, -128+0(%rbx)
    movdqa  %xmm1, -128+16(%rbx)
    movdqa  %xmm2, -128+32(%rbx)
    movdqa  %xmm3, -128+48(%rbx)

    pxor    0(%r14), %xmm12
    pxor    16(%r14), %xmm13
    pxor    32(%r14), %xmm14
    pxor    48(%r14), %xmm15

    movdqa  %xmm8, 0(%rbx)
    movdqa  %xmm9, 16(%rbx)
    movdqa  %xmm10, 32(%rbx)
    movdqa  %xmm11, 48(%rbx)
    movdqa  %xmm12, 0(%rdx)
    movdqa  %xmm13, 16(%rdx)
    movdqa  %xmm14, 32(%rdx)
    movdqa  %xmm15, 48(%rdx)

#ifdef M        
    push    %rdx
    push    %rax
    push    %rbx
    push    %rcx    
    xorq    %rax, %rax
    cpuid   
    rdtsc
    shlq    $32, %rdx
    orq %rdx, %rax
    subq    256+32+UNUSED_64b, %rax
    movq    %rax, 256+32+UNUSED_64b
    pop     %rcx        
    pop     %rbx    
    pop %rax
    pop %rdx
#endif
…
// safe place
    call do_debug
…
#ifdef M
    .cstring
measure:
        .ascii "%15lu\12\0"

        .section        __DATA,__data
    .align 2

count:
    .word 30000

    .text
do_measure:
    decb    count(%rip)
    jnz     done_measure
    pushq   %rax
    pushq   %rax    
    pushq   %rbx
    pushq   %rcx
    pushq   %rdx
    pushq   %rsi
    pushq   %rdi
    pushq   %rbp    
    pushq   %r9
    pushq   %r10
    pushq   %r11
    pushq   %r12
    pushq   %r13
    pushq   %r14
    pushq   %r15

        movq    16*8+UNUSED_64b, %rsi
        leaq    measure(%rip), %rdi
        xorl    %eax, %eax
        call    _applog

    popq    %r15
    popq    %r14
    popq    %r13    
    popq    %r12
    popq    %r11
    popq    %r10    
    popq    %r9
    popq    %rbp    
    popq    %rdi
    popq    %rsi
    popq    %rdx
    popq    %rcx
    popq    %rbx
    popq    %rax
    popq    %rax    
done_measure:
    ret
#endif

The sequence above is the one I found to be faster for the processor I am developing on, a Westmere Xeon W3680. The sequence I proposed in the question turned out to be terrible, for instance, maybe because it put too much distance between the instructions below that use the xmm registers and the instructions in which they were last set, forcing them to go through the register file too, and causing register read stalls.

UNUSED_64b is the name of an empty slot available on the stack because of alignment constraints. It had to be on the stack because the program uses threads:

#define UNUSED_64b         16(%rsp) 

The 256+32+ compensate for extraordinary usage of the stack at the points where the probes are set.

This assembly code is for Mac OS X. Some details would vary on another Unix-like system.

Pascal Cuoq
  • 79,187
  • 7
  • 161
  • 281