2

So i was trying to do an array operation that looked something like

for (int i=0;i++i<32)
{
    output[offset+i] += input[i];
}

where output and input are float arrays (which are 16-byte aligned thanks to malloc). However, I can't gurantee that offset%4=0. I was wondering how you could fix these alignment problems.

I though something like

while (offset+c %4 != 0)
{
    c++;
    output[offset+c] += input[c];
}

followed by an aligned loop - obviously this can't work as we now need an unaligned access to input.

Is there a way to vectorize my original loop?

John Palmer
  • 25,356
  • 3
  • 48
  • 67
  • 1
    Have you tried `_mm_loadu_ps()` or `_mm_storeu_ps`? They will let you do misaligned memory accesses. – Mysticial Apr 24 '12 at 03:00
  • @Mysticial - haven't tried those functions - are there any docs? - google doesn't show anything obvious and the gcc docs don't appear to have anything either – John Palmer Apr 24 '12 at 03:06
  • 1
    http://software.intel.com/sites/products/documentation/studio/composer/en-us/2011/compiler_c/intref_cls/common/intref_sse_load.htm and http://software.intel.com/sites/products/documentation/studio/composer/en-us/2011/compiler_c/intref_cls/common/intref_sse2_fp_store.htm – Mysticial Apr 24 '12 at 03:07
  • Do be aware that they will still incur a performance hit if the addresses are indeed misaligned. But that's unavoidable without resorting to very messy alignment hacks. – Mysticial Apr 24 '12 at 03:10
  • Are you responsible for allocating the input and output arrays? If you are, and you know the value of `offset` at allocation time, you could allocate some extra space and offset the output buffer pointer appropriately. Not the cleanest solution, but performant, and not too difficult. – Jason R Apr 24 '12 at 03:23
  • @JasonR The problem is that I call the function with lots of different `offset` values. I had actually considered having 4 different output arrays and then adding them together. I am going to test this later to see if it is any faster – John Palmer Apr 24 '12 at 03:34

1 Answers1

5

Moving comments to an answer:

There are SSE instructions for misaligned memory accesses. They are accessible via the following intrinsics:

and similarly for all the double and integer types.

So if you can't guarantee alignment, then this is the easy way to go. If possible, the ideal solution is to align your arrays from the start so that you avoid this problem altogether.

There will still be a performance penalty for misaligned accesses, but they're unavoidable unless you resort to extremely messy shift/shuffle hacks (such as _mm_alignr_epi8()).

The code using _mm_loadu_ps and _mm_storeu_ps - this is actually 50% slower than what gcc does by itself

for (int j=0;j<8;j++)
{
    float* out = &output[offset+j*4];
    __m128 in = ((__m128*)input)[j]; //this is aligned so no need for _mm_loadu_ps
    __m128 res  = _mm_add_ps(in,_mm_loadu_ps(out)); //add values 
    _mm_storeu_ps(out,res); //store result
}
John Palmer
  • 25,356
  • 3
  • 48
  • 67
Mysticial
  • 464,885
  • 45
  • 335
  • 332