Using STL vector with SIMD intrinsic data type

Question

As the title reads, I am trying to use STL vector with SIMD intrinsic data type. I know it is not a good practice due to the potential overhead of load/store, but I encountered a quite weird fault. Here is the code:

#include "immintrin.h"
#include <vector>
#include <stdio.h>

#define VL 8

int main () {
    std::vector<__m256> vec_1(10);
    std::vector<__m256> vec_2(10);

    float * tmp_1 = new float[VL];
    printf("vec_1[0]:\n");
    _mm256_storeu_ps(tmp_1, vec_1[0]); // seems to go as expected
    for (int i = 0; i < VL; ++i)
        printf("%f ", tmp_1[i]);
    printf("\n");
    delete tmp_1;

    float * tmp_2 = new float[VL];
    printf("vec_2[0]:\n");
    _mm256_storeu_ps(tmp_2, vec_2[0]); // segmentation fault
    for (int i = 0; i < VL; ++i)
        printf("%f ", tmp_2[i]);
    printf("\n");
    delete tmp_2;

    return 0;
}

I compiled it using g++ -O3 -g -std=c++11 -mavx2 test.cpp -o test. vec_1[0] is printed as expected (all zeros), but segmentation fault happens when it comes to vec_2[0]. I thought it was the alignment issue, but instead of _mm256_store_ps, I used _mm256_storeu_ps, which does not require alignment.

It is a Intel Haswell architecture with AVX2 extension. GCC version is 4.8.5.

Any possible clue is welcome.

When I compile and run your code, I get a segmentation fault *in the vector constructor*, because it's doing `fill_n` on misaligned `__m256`, I suppose. If I wrap your `__m256` in `std::aligned_storage` (and `reinterpret_cast` where appropriate), it runs fine and prints all zeroes. So presumably, yes, it is an alignment issue. I'm not sure how these things work, but `_mm256_storeu_ps` takes its 2nd argument *by value*, which is hence copied from the underlying vector by `operator[]`, whose elements aren't aligned! — user703016, Sep 21 '16 at 05:37
@PatrickM'Bongo Thanks for your advice. I tried `typedef std::aligned_storage::type __m256_pod;` then `std::vector<__m256_pod> vec_2(10);` and `_mm256_storeu_ps(tmp_2, reinterpret_cast<__m256&>(vec_2[0]));` but segmentation fault still happens at the same line. Did I do something wrong? I am sorry I am new to this. — MarZzz, Sep 21 '16 at 07:01
I wasn't aware of this but it seems the allocator is [allowed to ignore the requested alignment](http://stackoverflow.com/questions/16425359/should-stdvector-honour-alignofvalue-type). Indeed on my machine the first vector elements are aligned on 32 bytes, but the second vector element are aligned on 16 bytes. Not sure why. Perhaps the first vector gets a "fresh" memory chunk which *happens* to be nicely aligned? Either way, you will have to write your own `aligned_allocator` and pass it as template argument to your vector. — user703016, Sep 21 '16 at 07:35
Referencing `vec_2[0]` requires the memory used by `vec_2` to be aligned, since gcc will emit aligned loads/stores when dereferencing `__m256*`. If you check with a debugger, you should see the segfault on a load insn. If you use a vector of `float`, you could use unaligned load intrinsics on it. (But it would be more sensible to avoid the overhead of dynamically allocating memory at all for small fixed sizes. Just use a local array. If it's an array of float, use `alignas(32) float foo[VL*8]`, otherwise `__m256 foo[VL]` should correctly inherit the alignment requirement of `__m256`). — Peter Cordes, Sep 21 '16 at 09:15
Maybe also try not using such an old compiler, especially if you care about performance. AVX2 was still pretty new when gcc4.8 was released. The current versions are 5.4 or 6.2. — Peter Cordes, Sep 21 '16 at 09:16
@PeterCordes Thanks for your advice. This example is just a toy, but the size is input-dependent in the original one. I tried declaring it as `__m256 vec[2];` instead of `__m256 * vec = new __m256[2]` or `std::vector<__m256> vec(2)`, and found it worked. I.e., static array will respect its data type that `__m256` should be aligned, but `dynamic allocation` and `std::vector` won't. Is this true? — MarZzz, Sep 22 '16 at 01:03
I don't know; I've never tried to use a std::vector of `__m128` or `__m256`. If you want fast temporary storage, you usually don't need dynamic allocation. It sounds like Patrick is saying that it's actually not guaranteed to be aligned. (i.e. not guaranteed to work correctly). — Peter Cordes, Sep 22 '16 at 01:19
BTW, inside a function, `__m256 vec[2];` is "automatic" storage, i.e. on the stack, private to this function call. There is near-zero overhead for allocating automatic storage; space for all the locals for a function is reserved at once, with a single asm instruction (like `sub rsp, 512`). "static" storage is when it has a compile-time-constant address, like a global or `static __m256 vec[2];` (inside or outside a function). — Peter Cordes, Sep 22 '16 at 01:21
You could also try `boost::alignment::aligned_allocator` as allocator for the `std::vector` . — noma, Apr 24 '18 at 10:33

Using STL vector with SIMD intrinsic data type

0 Answers0

Linked