How can I get GCC to vectorize this simple copy loop with SSE instructions?

Question

This is a follow up to this question about getting GCC to optimize memcpy() in a loop; I've given up and decided to go the direct route of optimizing the loop manually.

I'm trying to stay as portable and maintainable as possible, though, so I'd like to get GCC to vectorize a simple optimized repeated copy-within-a-loop itself without resorting to SSE intrinsics. However, it seems to refuse doing so regardless of how much handholding I give it, despite the fact that the manually vectorized version (with the SSE2 MOVDQA instructions) is empirically up to 58% faster for small arrays (<32 elements) and at least 17% faster for larger ones (>=512).

Here's the version that isn't manually vectorized (with as many hints as I could think of to tell GCC to vectorize it):

__attribute__ ((noinline))
void take(double * out, double * in,
          int stride_out_0, int stride_out_1,
          int stride_in_0, int stride_in_1,
          int * indexer, int n, int k)
{
    int i, idx, j, l;
    double * __restrict__ subout __attribute__ ((aligned (16)));
    double * __restrict__ subin __attribute__ ((aligned (16)));
    assert(stride_out_1 == 1);
    assert(stride_out_1 == stride_in_1);
    l = k - (k % 8);
    for(i = 0; i < n; ++i) {
        idx = indexer[i];
        subout = &out[i * stride_out_0];
        subin = &in[idx * stride_in_0];
        for(j = 0; j < l; j += 8) {
            subout[j+0] = subin[j+0];
            subout[j+1] = subin[j+1];
            subout[j+2] = subin[j+2];
            subout[j+3] = subin[j+3];
            subout[j+4] = subin[j+4];
            subout[j+5] = subin[j+5];
            subout[j+6] = subin[j+6];
            subout[j+7] = subin[j+7];
        }
        for( ; j < k; ++j)
            subout[j] = subin[j];
    }
}

And here's my first attempt at manual vectorization, which I used for comparing performance (it could definitely be improved further, but I just wanted to test the most naive transformation possible):

__attribute__ ((noinline))
void take(double * out, double * in,
          int stride_out_0, int stride_out_1,
          int stride_in_0, int stride_in_1,
          int * indexer, int n, int k)
{
    int i, idx, j, l;
    __m128i * __restrict__ subout1 __attribute__ ((aligned (16)));
    __m128i * __restrict__ subin1 __attribute__ ((aligned (16)));
    double * __restrict__ subout2 __attribute__ ((aligned (16)));
    double * __restrict__ subin2 __attribute__ ((aligned (16)));
    assert(stride_out_1 == 1);
    assert(stride_out_1 == stride_in_1);
    l = (k - (k % 8)) / 2;
    for(i = 0; i < n; ++i) {
        idx = indexer[i];
        subout1 = (__m128i*)&out[i * stride_out_0];
        subin1 = (__m128i*)&in[idx * stride_in_0];
        for(j = 0; j < l; j += 4) {
            subout1[j+0] = subin1[j+0];
            subout1[j+1] = subin1[j+1];
            subout1[j+2] = subin1[j+2];
            subout1[j+3] = subin1[j+3];
        }
        j *= 2;
        subout2 = &out[i * stride_out_0];
        subin2 = &in[idx * stride_in_0];
        for( ; j < k; ++j)
            subout2[j] = subin2[j];
    }
}

(The actual code is only slightly more complicated to handle some special cases, but not in the way that affects GCC vectorization, since even the versions given above don't vectorize either: my test harness can be found on LiveWorkspace)

I'm compiling the first version with the following command line:

gcc-4.7 -O3 -ftree-vectorizer-verbose=3 -march=pentium4m -fverbose-asm \
    -msse -msse2 -msse3 take.c -DTAKE5 -S -o take5.s

And the resulting instructions used for the main copy loop are always FLDL/FSTPL pairs (i.e. copying in 8 byte units) rather than MOVDQA instructions, which result when I use SSE intrinsics manually.

The relevant output from tree-vectorize-verbose seems to be:

Analyzing loop at take.c:168

168: vect_model_store_cost: unaligned supported by hardware.
168: vect_model_store_cost: inside_cost = 8, outside_cost = 0 .
168: vect_model_load_cost: unaligned supported by hardware.
168: vect_model_load_cost: inside_cost = 8, outside_cost = 0 .
168: cost model: Adding cost of checks for loop versioning aliasing.

168: cost model: epilogue peel iters set to vf/2 because loop iterations are unknown .
168: cost model: the vector iteration cost = 16 divided by the scalar iteration cost = 16 is greater or equal to the vectorization factor = 1.
168: not vectorized: vectorization not profitable.

I'm not sure why it's referring to "unaligned" stores and loads, and in any case the problem seem to be that the vectorization can't be proven to be profitable (even though empirically, it is for all cases that matter, and I'm not sure what cases where it wouldn't be).

Is there any simple flag or hint that I'm missing here, or does GCC just not want to do this no matter what?

_{I'm going to be embarrassed if this is something obvious, but hopefully this can help someone else, too, if it is.}

All those `__attribute__ ((aligned (16)))` directives are doing very little as they just define the alignment of the pointer variable itself, not the data that the pointer points to. — Paul R, Mar 25 '13 at 18:24
@PaulR oh, really? that might be it then. How do I tell it that the data is aligned? I can't statically allocate the arrays aligned in the actual code, they're buffers being passed in from elsewhere. — Stephen Lin, Mar 25 '13 at 18:25
Take a look at `__builtin_assume_aligned`: http://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html — Paul R, Mar 25 '13 at 18:29
@PaulR thanks, that works...do you want a post an answer so I can accept it? guess it was obvious after all :) — Stephen Lin, Mar 25 '13 at 18:30

score 4 · Accepted Answer · answered Mar 25 '13 at 18:31

4

All the __attribute__ ((aligned (16))) directives are achieving very little as they just define the alignment of the pointer variable itself, not the data that the pointer points to.

You probably need to look at __builtiin_assume_aligned.

answered Mar 25 '13 at 18:31

Paul R

208,748
37
389
560

1

+1, this works, thanks for the help and correcting this newbie misunderstanding – Stephen Lin Mar 25 '13 at 18:33
Actually, it works in that there's SSE instructions in the output and GCC claims the loop is vectorized, but my manual vectorization is till doing much better...I guess it's a different question though...have to read through the assembly... – Stephen Lin Mar 25 '13 at 18:35
Auto-vectorization is rarely as good as vectorization by hand, even when you supply all possible compiler hints. You might want to try a 30 day evaluation license of ICC though - Intel has been working on auto-vectorization a lot longer than the gcc contributors. – Paul R Mar 25 '13 at 18:59
Well, I would do that if this was going to be distributed pre-built, but it will be compiled on the user's machine actually, in general. Anyway, I can probably figure this out myself, but it seems like GCC is still creating two different versions of the loop, one with SSE2 and one without, and using some condition to check which one to use, so I'm not even sure it's using the SSE2 version (the performance is more or less the same as it was before, and nowhere close to the manual version)...do you know if that's expected? – Stephen Lin Mar 25 '13 at 19:02
I basically changed `subout = &out[i * stride_out_0];` to `subout = __builtin_assume_aligned(&out[i * stride_out_0], 16)`; and `subin = &in[idx * stride_in_0];` to `subin = __builtin_assume_aligned(&in[idx * stride_in_0], 16);`, within the inner loop. – Stephen Lin Mar 25 '13 at 19:03

How can I get GCC to vectorize this simple copy loop with SSE instructions?

1 Answers1