hybrid assembly scalar/vector on Power7 architecture

Question

Since 2 years, I am developing a library: cyme to perform SIMD computation over "friendly container". I am able to reach the maximum performance of the processor. Typically user defined container and write a kernel under the following syntax (trivial example):

 for(i...)
 W[i] = R[i]+R[i]+R[i]+R[i]+R[i];

R[i]+R[i]+ ... perform the operations using SIMD registers. I have a precise control of the generation of the asm (using template expression). I am fully satisfied, however I am exploring the Power architecture since a few days. Power7 processor has 4 floating point unit and one vector unit (from wikipedia I read:"The POWER7 processor has an Instruction Sequence Unit that is capable of dispatching up to six instructions per cycle to a set of queues").

My idea was to generate ASM combining serial and vector instructions, thus I may be able to use the 5 units simultaneously. I did it, and my pb starts now:

The first ASM version of the previous code, pure SIMD-Power is:

.L536:
    lxvd2x 0,0,9
    stxvd2x 0,1,31
    lxvd2x 12,0,9
    stxvd2x 12,1,30
    xvadddp 0,0,12
    lxvd2x 12,0,9
    xvadddp 0,0,12
    xvadddp 0,0,12
    xvadddp 0,0,12
    stxvd2x 0,0,9
    addi 9,9,176
    cmpld 7,28,9
    bne 7,.L536

The "nice" hybrid serial/SIMD (the loop does less iteration) is:

.L547:
    std 31,128(1)
    std 31,136(1)
    lfd 12,24(9)
    stxvd2x 63,1,30
    lfd 11,16(9)
    fadd 10,12,12
    fadd 9,11,11
    fadd 10,10,12
    fadd 9,9,11
    fadd 10,10,12
    fadd 9,9,11
    lxvd2x 0,0,9
    std 31,480(1)
    std 31,488(1)
    stfd 11,128(1)
    stfd 12,136(1)
    stxvd2x 63,1,29
    stxvd2x 0,1,30
    fadd 10,10,12
    fadd 9,9,11
    stfd 10,24(9)
    stfd 9,16(9)
    lxvd2x 10,0,9
    stfd 11,480(1)
    stfd 12,488(1)
    stxvd2x 10,1,29
    xvadddp 0,0,10
    lxvd2x 12,0,9
    xvadddp 0,0,12
    xvadddp 0,0,12
    xvadddp 0,0,12
    stxvd2x 0,0,9
    addi 9,9,352
    cmpld 7,28,9
    bne 7,.L547

The benchmark (one thread but should I use two ?) of the first code is 0.2 [s] whereas the hybrid version is 0.25 [s]. My knowledge on processors architecture is too limited to understand why the hybrid version is slower.

Generate assembly language mixing vector and serial instructions was a charming idea, so if anybody has a suggestion, is it possible or not ?

Best,

++t

ps1: a SIMD unroll version should be faster, I know and I did it, but I am now focusing on this hybrid version.

ps2: gcc 4.9.1, Power7-IBM,8205-E6C

*"... My knowledge on processors architecture is too limited to understand why the hybrid version is slower."* - I've noticed two things that may affect the performance. First is the loop unrolling. I've seen SHA drop in speed when the loops were unrolled (all other code is the same). Second, the interleaving of integer operations with Altivec/SIMD instructions. Avoid interleaving them when you are doing SIMD operations (the problem got worse on Power9; things slowed down even more). — jww, Oct 31 '18 at 18:57

score 2 · Answer 1 · answered Feb 25 '15 at 22:31

I don't have any hands on experience with these, but according to this PDF, it sounds like the 7 series merged the previously separate scalar and vector floating point units to save die space. If that is accurate, the interleaving won't be able to achieve any kind of parallelization beyond the vectorized instructions.

From the abstract:

Unlike previous PowerPC designs, the POWER7 FPU merges the scalar and vector FPUs into a single unit executing three floating-point instruction sets

Do you have access to a POWER6 to test your interleaved code? I would be interested to see how that goes.

Power6 no but Power8 in a fews days ... thank you for the paper. — Timocafé, Feb 26 '15 at 07:25

hybrid assembly scalar/vector on Power7 architecture

1 Answers1