Are either Clang or GCC able to autovectorize manually unrolled loops?

Question

I have an idea for a code style for writing specific kinds of numerical algorithms where you write your algorithm purely in data-layout agnostic fashion.

i.e. All of your functions take (one or more) scalar arguments, and return (through a pointer) one or more scalar return values. So, for example, if you have a function that takes a 3d float vector, instead of taking a struct with three members, or float[3] xyz, you take float x, float y, float z.

The idea is that you can change the layout of your input and output data, i.e. you can play with struct of array vs. array of struct data layout, tiled layouts for cache efficiency, SIMD vs. multicore granularity, etc... WITHOUT having to rewrite all of your code for all combinations of data layouts.

The strategy has some obvious downsides:

You can't use for loops inside your functions to make your code more compact
Your functions need more parameters in their signatures

...but those are palatable if your arrays are short and it saves you having to rewrite your code a bunch of times to make it fast.

But in particular, I am worried that compilers might not be able to take stuff like x+=a; y+=b; z+=c; w+=d and autovectorize it into a single SIMD vector add, in the case where you want to do SIMD at the bottom of your call stack, as opposed to doing SIMD at the top of a stack of inlined functions.

Are clang and/or gcc able to "re-roll" manually unrolled loops in C and/or C++ code (probably after functions are inlined) and generate vectorized machine code?

This is not possible in C. Pick one language. And the whole question is too broad. We are no discussion site. — too honest for this site, Dec 06 '16 at 12:11
Hi Olaf, your answer, "no" is quite possibly correct and there is no need for 4 copies of the same question. — Andrew Wagner, Dec 06 '16 at 12:13
I am going to have to concur, this question should pick just one language. — Vality, Dec 06 '16 at 15:52
SIMD intrinsics are the same in C and C++, picking one doesn't make any difference. — harold, Dec 06 '16 at 15:54

Andrew Wagner · Answer 1 · 2016-12-07T14:39:52.107

I wrote some code to do a trivial test of my idea:

// Compile using gcc -O4 main.c && objdump -d a.out

void add4(float x0, float x1, float x2, float x3, 
          float y0, float y1, float y2, float y3, 
          float* out0, float* out1, float* out2, float* out3) {
  // Non-inlined version of this uses xmm registers and four separate
  // SIMD operations
    *out0 = x0 + y0;
    *out1 = x1 + y1;
    *out2 = x2 + y2;
    *out3 = x3 + y3;
}
void sub4(float x0, float x1, float x2, float x3,
          float y0, float y1, float y2, float y3,
          float* out0, float* out1, float* out2, float* out3) {
    *out0 = x0 - y0;
    *out1 = x1 - y1;
    *out2 = x2 - y2;
    *out3 = x3 - y3;
}
void add4_then_sub4(float x0, float x1, float x2, float x3,
          float y0, float y1, float y2, float y3,
          float z0, float z1, float z2, float z3,
          float* out0, float* out1, float* out2, float* out3) {
    // In non-inlined version of this, add4 and sub4 get inlined.
    // xmm regiesters get re-used for the add and subtract,
    // but there is still no 4-way SIMD
  float temp0,temp1,temp2,temp3;
  // temp= x + y
  add4(x0,x1,x2,x3,
       y0,y1,y2,y3,
       &temp0,&temp1,&temp2,&temp3);
  // out = temp - z
  sub4(temp0,temp1,temp2,temp3,
       z0,z1,z2,z3,
       out0,out1,out2,out3);
}
void add4_then_sub4_arrays(const float x[4],
                                const float y[4],
                                const float z[4],
                                float out[4])
{
    // This is a stand-in for the main function below, but since the arrays are aguments,
    // they can't be optimized out of the non-inlined version of this function.
    // THIS version DOES compile into (I think) a bunch of non-aligned moves,
    // and a single vectorized add a single vectorized subtract
    add4_then_sub4(x[0],x[1],x[2],x[3],
            y[0],y[1],y[2],y[3],
            z[0],z[1],z[2],z[3],
            &out[0],&out[1],&out[2],&out[3]
            );
}

int main(int argc, char **argv) 
{
}

Consider the generated assembly for add4_then_sub4_arrays:

0000000000400600 <add4_then_sub4_arrays>:
  400600:       0f 57 c0                xorps  %xmm0,%xmm0
  400603:       0f 57 c9                xorps  %xmm1,%xmm1
  400606:       0f 12 06                movlps (%rsi),%xmm0
  400609:       0f 12 0f                movlps (%rdi),%xmm1
  40060c:       0f 16 46 08             movhps 0x8(%rsi),%xmm0
  400610:       0f 16 4f 08             movhps 0x8(%rdi),%xmm1
  400614:       0f 58 c1                addps  %xmm1,%xmm0
  400617:       0f 57 c9                xorps  %xmm1,%xmm1
  40061a:       0f 12 0a                movlps (%rdx),%xmm1
  40061d:       0f 16 4a 08             movhps 0x8(%rdx),%xmm1
  400621:       0f 5c c1                subps  %xmm1,%xmm0
  400624:       0f 13 01                movlps %xmm0,(%rcx)
  400627:       0f 17 41 08             movhps %xmm0,0x8(%rcx)
  40062b:       c3                      retq   
  40062c:       0f 1f 40 00             nopl   0x0(%rax)

The arrays aren't aligned, so there are a lot more move ops than ideal, and I'm not sure what that xor is doing in there, but there is indeed one 4-way add and one 4-way subtract as desired.

So the answer is that gcc has at least ~some ability to pack scalar floating point operations back into SIMD operations.

Update: Tighter code with both gcc-4.8 -O3 -march=native main.c && objdump -d a.out:

0000000000400600 <add4_then_sub4_arrays>:
  400600:       c5 f8 10 0e             vmovups (%rsi),%xmm1
  400604:       c5 f8 10 07             vmovups (%rdi),%xmm0
  400608:       c5 f0 58 c0             vaddps %xmm0,%xmm1,%xmm0
  40060c:       c5 f8 10 0a             vmovups (%rdx),%xmm1
  400610:       c5 f8 5c c1             vsubps %xmm1,%xmm0,%xmm0
  400614:       c5 f8 11 01             vmovups %xmm0,(%rcx)
  400618:       c3                      retq   
  400619:       0f 1f 80 00 00 00 00    nopl   0x0(%rax)

and with clang-4.0 -O3 -march=native main.c && llvm-objdump -d a.out:

add4_then_sub4_arrays:
  4005e0:       c5 f8 10 07                                     vmovups (%rdi), %xmm0
  4005e4:       c5 f8 58 06                                     vaddps  (%rsi), %xmm0, %xmm0
  4005e8:       c5 f8 5c 02                                     vsubps  (%rdx), %xmm0, %xmm0
  4005ec:       c5 f8 11 01                                     vmovups %xmm0, (%rcx)
  4005f0:       c3                                              ret
  4005f1:       66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00    nopw    %cs:(%rax,%rax)

`xor` is used to break the dependency that `movlps` would otherwise have on the old value — harold, Dec 06 '16 at 15:53
xorps + movlps is a braindead alternative to `movsd (%rdx), %xmm1`. And then it's followed by movhps from contiguous bytes? What the hell? What compiler did you use, with what settings? Obviously `movups (%rdx), %xmm1` would be more efficient, especially on any half-way recent CPU. Doing an unaligned load in two halves was a reasonable strategy on some quite old CPUs. — Peter Cordes, Dec 07 '16 at 00:21
upvoted for testing and showing that this implementation of your idea is not viable with the compiler + options you tested with. 3x to 4x the instruction count for memory source data is ridiculous. (And all those movlps + movhps pairs will bottleneck on the shuffle port, since they're load+blend instructions. See http://agner.org/optimize/ for instruction tables, and the [x86 tag wiki](http://stackoverflow.com/tags/x86/info)). — Peter Cordes, Dec 07 '16 at 00:26
Thanks Peter! I will go ahead and pick a compiler and figure out the right flags for the architecture — Andrew Wagner, Dec 07 '16 at 07:43

score -1 · Answer 2 · answered Dec 06 '16 at 12:58

-1

Your concern is correct. No compiler is going to autovectorize those 4 adds. It's simply not worth it, considering the inputs aren't contiguous and aligned. The cost of gathering arguments into a SIMD register are much higher than the saving of a vector addition.

Of course, the reason the compiler can't use an aligned streaming load is because you passed the arguments as scalars.

answered Dec 06 '16 at 12:58

MSalters

173,980
10
155
350

Hi! The idea is that you lay out those scalar arguments linearly outside the function. If the function gets inlined, the compiler has the data layout, and the same definition of what happens to the values, but not expressed with a for loop. I'm working on a more concrete example. – Andrew Wagner Dec 06 '16 at 13:22
These days SIMD registers ~do often get used even when it's just one floating point operation. – Andrew Wagner Dec 06 '16 at 13:58
Compilers can and do vectorize a single vector-width operation, if contiguous pointers were passed and the function was inlined so the compiler knew that. Especially clang is good at this. (I can dig up an example if you want). – Peter Cordes Dec 07 '16 at 00:25

Are either Clang or GCC able to autovectorize manually unrolled loops?

2 Answers2