Vector Scalar multiplication AVX segmentation fault on Mac OSX

Question

Hi I am trying to write a code for Vector-Scalar multiplication using AVX on Sandy Bridge processor i7-3720QM (~2012). The code is a C code compiled with GNU gcc on Mac OSX 10.8.

gcc -mavx -Wa,-q -o bb5 code1.c -lm

I am getting Segmentation fault: 11. Please help.

Output:

3.000000 6.000000 9.000000 12.000000 
Segmentation fault: 11

So, it looks like the store command is not working correctly ? Thanks. Eventually I want to do something like A = A + x*B where x is a scalar and A and B are vector. The function void matsca(const double* a, double c, double *b) will be called again and again to operate on a double vector of large dimension with stride of 8 since AVX can take 4 double elements (256 bits). Thanks for your help.

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <immintrin.h>

void matsca(const double* a, double c, double *b)
{
    __m256d a0 = _mm256_loadu_pd(a+0);
    __m256d a1 = _mm256_set1_pd(c);

    __m256d a2 = _mm256_mul_pd(a0,a1);

    double* f = (double*)&a2;
    printf("%f %f %f %f \n",f[0],f[1],f[2],f[3]);

    _mm256_store_pd(b,a2);
}

int main()
{
    double m1[11]={1,2,3,4,5,6,7,8,9,10,11};
    double *m3;
    double m2=3;
    int i;

    matsca(&m1[0],m2,&m3[0]);

    for (i=0; i<3; i=i+1)
    {
        printf("%d %f \n",i,m3[i]);
    }

    return 0;
}

And after you have allocated some memory for `m3` you also need to change `_mm256_store_pd` to `_mm256_storeu_pd`. — Paul R, Jan 30 '17 at 09:00
excellent thanks to both of you. it worked. is there any other way to do this more efficiently ? — Guddu, Jan 30 '17 at 09:08
"More efficiently" ? That's more of a design issue, but typically you'd want to process a *lot* more data in one function call, e.g. have a loop in `matsca` so that you can process an arbitrary amount of data, with just one function call and one initialisation of the constant vector etc. — Paul R, Jan 30 '17 at 10:15
@Guddu: I've added an answer now, with a fixed version of `matsca` and a suggested improved version for arbitrary size vectors. — Paul R, Jan 30 '17 at 10:57

score 0 · Accepted Answer · answered Jan 30 '17 at 10:56

Here is a fixed/improved version of your original matsca:

inline void matsca(const double *a, const double c, double *b)
{
    __m256d a0 = _mm256_loadu_pd(a);
    __m256d a1 = _mm256_set1_pd(c);
    __m256d a2 = _mm256_mul_pd(a0, a1);

 #if DEBUG > 0
    double *f = (double *)&a2;
    printf("%f %f %f %f\n", f[0], f[1], f[2], f[3]);
 #endif

    _mm256_storeu_pd(b, a2);
}

However for you might want to consider making this more general, so that it can process any size of vector, e.g.

inline void matsca(const double *a, const double c, double *b, const size_t n)
{
    const __m256d a1 = _mm256_set1_pd(c);
    size_t i;

    for (i = 0; i + 4 <= n; i += 4)
    {
        __m256d a0 = _mm256_loadu_pd(&a[i]);
        __m256d a2 = _mm256_mul_pd(a0, a1);
        _mm256_storeu_pd(b, &a2[i]);
    }
    for ( ; i < n; ++i) // handle any odd elements at end of vector
    {
        a2[i] = a1[i] * a2;
    }
}

This way you amortise the cost of the function call, initialising the constant vector, etc.

can you please explain what is the meaning of adding `inline void` and `size_t` and why did you declare the arrays as ` const` ? sorry i am a newbie. thanks. — Guddu, Feb 05 '17 at 11:20
It's generally a good idea to make input parameters const. Also size_t is the natural type for specifying the (unsigned) size of an array or other data structure. inline is a hint to the compiler to eliminate the overhead of function calls for small performance-critical functions. — Paul R, Feb 05 '17 at 11:40

Vector Scalar multiplication AVX segmentation fault on Mac OSX

1 Answers1