-1

I want to multiply two (float/double) vectors with AVX operators. In order to do that, I need aligned memory. My function for float values is:

#define SIZE 65536
float *g, *h, *j;
g = (float*)aligned_alloc(32, sizeof(float)*SIZE);
h = (float*)aligned_alloc(32, sizeof(float)*SIZE);
j = (float*)aligned_alloc(32, sizeof(float)*SIZE);
//Filling g and h with data
for(int i = 0; i < SIZE/8; i++)
    {
        __m256 a_a, b_a, c_a;
        a_a = _mm256_load_ps(g+8*i);
        b_a = _mm256_load_ps(h+8*i);
        c_a = _mm256_mul_ps(a_a, b_a);
        _mm256_store_ps (j+i*8, c_a);
    }
free(g);
free(h);
free(j);

That works, but when I am trying to do that with double values, I get a memory access error (such as if the memory is not aligned correctly):

double *g_d, *h_d, *i_d;
g_d = (double*)aligned_alloc(32, sizeof(double)*SIZE);
h_d = (double*)aligned_alloc(32, sizeof(double)*SIZE);
i_d = (double*)aligned_alloc(32, sizeof(double)*SIZE);
for(int i = 0; i < SIZE/4; i++)
{
    __m256d a_a, b_a, c_a;
    a_a = _mm256_load_pd(g_d+4*i);
    b_a = _mm256_load_pd(h_d+4*i);
    c_a = _mm256_mul_pd(a_a, b_a);
    _mm256_store_pd (i_d+i*4, c_a);
}
free(g_d);
free(h_d);
free(i_d);

Why is the alignment not working for the double-values?

When running it in gdb, I get

Program received signal SIGSEGV, Segmentation fault.
0x0000000000401669 in _mm256_load_pd (__P=0x619f70) at /usr/lib/gcc/x86_64-linux-gnu/5/include/avxintrin.h:836

Edit: I found my mistake, it was a copy/paste error from a former function, which manifested in that function. Due to not being helpful for others (as I assume), I close the question.

arc_lupus
  • 3,942
  • 5
  • 45
  • 81
  • 1
    These identifier names suck rocks. What the heck is a, b, d? Always copy/paste code from your text editor. From a test program that has this problem, never make anything up. – Hans Passant May 25 '16 at 09:31
  • Fixed the variables, but will add a short test program later. – arc_lupus May 25 '16 at 10:02
  • 1
    Works for Me (tm). Have you used a debugger? Exactly what line does it fail on and whats the value of the address its reading from (or writing to?). Whats the exact failure code. – Mike Vine May 25 '16 at 10:06
  • @MikeVine: I added the debugger output. – arc_lupus May 25 '16 at 10:43

1 Answers1

0

Well, your problem seems to stem from different data sizes.

  • In your first snippet you increment the float loop to SIZE/8=8192. Here I'm unsure why you would increase a FLOAT array with element size 4 by 8. So i < 8192
  • In your second snippet you increment the double loop to SIZE/4=16384. Here I'm unsure why you would increase a DOUBLE array with element size 8 by 4. So i < 16384 --- ** The opposite!**

The last element of the DOUBLE array may surpass your memory boundaries!

In both cases you increment your loop with i++. So the cases proceed as follows:

First : (FLOAT (4)) j+i*8 (0 < i < 8192 ) =>

0      4      8      12      16     20     24     28  
v1     .      v2     .       v3     .      v4     . 

Second: (DOUBLE(8)) j+i*4 (0 < i < 16384) => v1/v2/v3/v4

0      4      8      12      16     20     24     28     32  
v1(h)  v1(l)  v2(l)  v3(l)   v4(l)  v5(l)  v6(l)  v7(l) 
v1(h)  v2(h)  v3(h)  v4(h)   v5(h)  v6(h)  v7(h)  v8(h)  v8(h)
--------------------------------------------------------------
some thing ... some thing ... some thing .. some thing ...

In the second snippet you mix up the high parts(32-bit) and the low parts(32-bit) of the 64-bit Double by only incrementing by 4 (sizeof FLOAT) instead of 8 (sizeof DOUBLE).

Another problem is that _mm256_store_pd requires that...

When the source or destination operand is a memory operand, the operand must be aligned on a 32-byte boundary or a general-protection exception (#GP) will be generated.

for(int i = 0; i < SIZE/4; i++) doesn't fulfill that requirement.

I am wondering that your FLOAT version seems to work, because _mm256_store_ps requires that...

When the source or destination operand is a memory operand, the operand must be aligned on a 16-byte boundary or a general-protection exception (#GP) will be generated.

but you only have an alignment of 8 bytes...

However, you need to fix the 'scale' of your i variable to make this work.

zx485
  • 28,498
  • 28
  • 50
  • 59
  • Strange, but if I compare the results afterwards (by testing each element with `c[i] != a[i]*b[i]` in a loop, I get correct results... – arc_lupus May 25 '16 at 10:37
  • 1
    @arc_lupus: If you get correct results, what may have been your real question/problem? I (simply) analyzed your code for potential weaknesses/errors. And I don't know why your test code returns the correct values, I just emphasized potential sources of problems. – zx485 May 25 '16 at 10:45
  • I was wondering, why you say that my code is wrong, but I still get a correct result for the float values afterwards? – arc_lupus May 25 '16 at 10:48
  • 1
    @arc_lupus: Your question is weird: you said your code isn't working because of a - and I quote - `memory access error (such as if the memory is not aligned correctly)`. I tried to approximate that problem, so what is your's? To put it differently: if you think your code was all OK, why did you ask a question? – zx485 May 25 '16 at 10:52
  • Still the same, but your suggestion, that my code is inherently wrong due to having the wrong step sizes, somehow does not work out, after I get correct results for the float values for my approach. You suggest that I should adjust the stepsize to +=4 for float, is that correct? – arc_lupus May 25 '16 at 10:54
  • Well, I merely stated that a `FLOAT` consists of 4 bytes and a `DOUBLE` consists of 8 bytes and that the steps of the loops should be tailored accordingly. Calculate that! So a 4-byte-stepping in a `DOUBLE` loop may cause problems... – zx485 May 25 '16 at 11:02
  • 1
    please note that `float *x; float* y = x + k;` steps by `k x sizeof(float)` and similar for `double`, I think your analysis of arc's code is wrong. – BeyelerStudios May 25 '16 at 11:14
  • 1
    @zx485: I think you may be confused here - the factors of 4 and 8 are the number of elements per vector, not the size of the elements in bytes - I think the loop increments are actually correct in the OP's code. – Paul R May 25 '16 at 11:17
  • @Paul R: Yes. I may have been confused. I'm not bulletproof concerning C[++] and I did not compile that to assembly to make sure. – zx485 May 25 '16 at 11:32
  • 1
    @zx485: yes, it's a little confusing, particularly as the OP has used hard-coded literals instead of meaningful constants, but for SIMD loops the increment will be the no of elements per vector, e.g. if you have `SIZE` `float`s in an array and you are using AVX vectors with 8 `float`s per vector, the loop increment will be `SIZE / 8`. – Paul R May 25 '16 at 11:36