Since 2 years, I am developing a library: cyme to perform SIMD computation over "friendly container". I am able to reach the maximum performance of the processor. Typically user defined container and write a kernel under the following syntax (trivial example):
for(i...)
W[i] = R[i]+R[i]+R[i]+R[i]+R[i];
R[i]+R[i]+ ... perform the operations using SIMD registers. I have a precise control of the generation of the asm (using template expression). I am fully satisfied, however I am exploring the Power architecture since a few days. Power7 processor has 4 floating point unit and one vector unit (from wikipedia I read:"The POWER7 processor has an Instruction Sequence Unit that is capable of dispatching up to six instructions per cycle to a set of queues").
My idea was to generate ASM combining serial and vector instructions, thus I may be able to use the 5 units simultaneously. I did it, and my pb starts now:
The first ASM version of the previous code, pure SIMD-Power is:
.L536:
lxvd2x 0,0,9
stxvd2x 0,1,31
lxvd2x 12,0,9
stxvd2x 12,1,30
xvadddp 0,0,12
lxvd2x 12,0,9
xvadddp 0,0,12
xvadddp 0,0,12
xvadddp 0,0,12
stxvd2x 0,0,9
addi 9,9,176
cmpld 7,28,9
bne 7,.L536
The "nice" hybrid serial/SIMD (the loop does less iteration) is:
.L547:
std 31,128(1)
std 31,136(1)
lfd 12,24(9)
stxvd2x 63,1,30
lfd 11,16(9)
fadd 10,12,12
fadd 9,11,11
fadd 10,10,12
fadd 9,9,11
fadd 10,10,12
fadd 9,9,11
lxvd2x 0,0,9
std 31,480(1)
std 31,488(1)
stfd 11,128(1)
stfd 12,136(1)
stxvd2x 63,1,29
stxvd2x 0,1,30
fadd 10,10,12
fadd 9,9,11
stfd 10,24(9)
stfd 9,16(9)
lxvd2x 10,0,9
stfd 11,480(1)
stfd 12,488(1)
stxvd2x 10,1,29
xvadddp 0,0,10
lxvd2x 12,0,9
xvadddp 0,0,12
xvadddp 0,0,12
xvadddp 0,0,12
stxvd2x 0,0,9
addi 9,9,352
cmpld 7,28,9
bne 7,.L547
The benchmark (one thread but should I use two ?) of the first code is 0.2 [s] whereas the hybrid version is 0.25 [s]. My knowledge on processors architecture is too limited to understand why the hybrid version is slower.
Generate assembly language mixing vector and serial instructions was a charming idea, so if anybody has a suggestion, is it possible or not ?
Best,
++t
ps1: a SIMD unroll version should be faster, I know and I did it, but I am now focusing on this hybrid version.
ps2: gcc 4.9.1, Power7-IBM,8205-E6C