This is a follow up to this question about getting GCC to optimize memcpy()
in a loop; I've given up and decided to go the direct route of optimizing the loop manually.
I'm trying to stay as portable and maintainable as possible, though, so I'd like to get GCC to vectorize a simple optimized repeated copy-within-a-loop itself without resorting to SSE intrinsics. However, it seems to refuse doing so regardless of how much handholding I give it, despite the fact that the manually vectorized version (with the SSE2 MOVDQA
instructions) is empirically up to 58% faster for small arrays (<32 elements) and at least 17% faster for larger ones (>=512).
Here's the version that isn't manually vectorized (with as many hints as I could think of to tell GCC to vectorize it):
__attribute__ ((noinline))
void take(double * out, double * in,
int stride_out_0, int stride_out_1,
int stride_in_0, int stride_in_1,
int * indexer, int n, int k)
{
int i, idx, j, l;
double * __restrict__ subout __attribute__ ((aligned (16)));
double * __restrict__ subin __attribute__ ((aligned (16)));
assert(stride_out_1 == 1);
assert(stride_out_1 == stride_in_1);
l = k - (k % 8);
for(i = 0; i < n; ++i) {
idx = indexer[i];
subout = &out[i * stride_out_0];
subin = &in[idx * stride_in_0];
for(j = 0; j < l; j += 8) {
subout[j+0] = subin[j+0];
subout[j+1] = subin[j+1];
subout[j+2] = subin[j+2];
subout[j+3] = subin[j+3];
subout[j+4] = subin[j+4];
subout[j+5] = subin[j+5];
subout[j+6] = subin[j+6];
subout[j+7] = subin[j+7];
}
for( ; j < k; ++j)
subout[j] = subin[j];
}
}
And here's my first attempt at manual vectorization, which I used for comparing performance (it could definitely be improved further, but I just wanted to test the most naive transformation possible):
__attribute__ ((noinline))
void take(double * out, double * in,
int stride_out_0, int stride_out_1,
int stride_in_0, int stride_in_1,
int * indexer, int n, int k)
{
int i, idx, j, l;
__m128i * __restrict__ subout1 __attribute__ ((aligned (16)));
__m128i * __restrict__ subin1 __attribute__ ((aligned (16)));
double * __restrict__ subout2 __attribute__ ((aligned (16)));
double * __restrict__ subin2 __attribute__ ((aligned (16)));
assert(stride_out_1 == 1);
assert(stride_out_1 == stride_in_1);
l = (k - (k % 8)) / 2;
for(i = 0; i < n; ++i) {
idx = indexer[i];
subout1 = (__m128i*)&out[i * stride_out_0];
subin1 = (__m128i*)&in[idx * stride_in_0];
for(j = 0; j < l; j += 4) {
subout1[j+0] = subin1[j+0];
subout1[j+1] = subin1[j+1];
subout1[j+2] = subin1[j+2];
subout1[j+3] = subin1[j+3];
}
j *= 2;
subout2 = &out[i * stride_out_0];
subin2 = &in[idx * stride_in_0];
for( ; j < k; ++j)
subout2[j] = subin2[j];
}
}
(The actual code is only slightly more complicated to handle some special cases, but not in the way that affects GCC vectorization, since even the versions given above don't vectorize either: my test harness can be found on LiveWorkspace)
I'm compiling the first version with the following command line:
gcc-4.7 -O3 -ftree-vectorizer-verbose=3 -march=pentium4m -fverbose-asm \
-msse -msse2 -msse3 take.c -DTAKE5 -S -o take5.s
And the resulting instructions used for the main copy loop are always FLDL
/FSTPL
pairs (i.e. copying in 8 byte units) rather than MOVDQA
instructions, which result when I use SSE intrinsics manually.
The relevant output from tree-vectorize-verbose
seems to be:
Analyzing loop at take.c:168
168: vect_model_store_cost: unaligned supported by hardware.
168: vect_model_store_cost: inside_cost = 8, outside_cost = 0 .
168: vect_model_load_cost: unaligned supported by hardware.
168: vect_model_load_cost: inside_cost = 8, outside_cost = 0 .
168: cost model: Adding cost of checks for loop versioning aliasing.
168: cost model: epilogue peel iters set to vf/2 because loop iterations are unknown .
168: cost model: the vector iteration cost = 16 divided by the scalar iteration cost = 16 is greater or equal to the vectorization factor = 1.
168: not vectorized: vectorization not profitable.
I'm not sure why it's referring to "unaligned" stores and loads, and in any case the problem seem to be that the vectorization can't be proven to be profitable (even though empirically, it is for all cases that matter, and I'm not sure what cases where it wouldn't be).
Is there any simple flag or hint that I'm missing here, or does GCC just not want to do this no matter what?
I'm going to be embarrassed if this is something obvious, but hopefully this can help someone else, too, if it is.