2

I am using Agner Fog's vectorclass library to use SIMD instructions (AVX specifically) in my application. Since it is best to use struct-of-array datastructures for easily employing SIMD, I quite often use:

std::vector<Vec8d> some_var;

or even

struct some_struct {
    std::vector<Vec8d> a;
    std::vector<Vec8d> b;
}

I wonder if this is bad (performance-wise or even just completely wrong?) considering that the std::vector internal Vec8d* array may in fact not be aligned?

wvc
  • 91
  • 1
  • 3
  • 5
    completely depends on what you do with that data-structure – 463035818_is_not_an_ai Feb 05 '21 at 11:10
  • 3
    What does "may in fact not be aligned" mean, and why do you think that is the case? – underscore_d Feb 05 '21 at 11:11
  • 2
    data structures don't "perform", operations you apply to data structures do – 463035818_is_not_an_ai Feb 05 '21 at 11:17
  • if splitting it up means that you no longer have to delete items in the middle of the vector - then yes, it can certainly offer decent performance improvements. – UKMonkey Feb 05 '21 at 11:21
  • 4
    @underscore_d: it means prior to C++17, `std::vector` didn't respect `alignof(T)` and thus was broken for types like `__m512d` (or Vec8d which wraps it). But finally C++17 fixed standard allocators to be usable for over-aligned types (more than alignof(max_align_t)). Some compilers (e.g. MSVC and ICC) only ever use unaligned load/store instructions even when a type like `__m512d` promises that the memory is aligned, but gcc and clang will use `vmovapd` to store `__m512d`. – Peter Cordes Feb 05 '21 at 12:09

3 Answers3

3

I would generally use vector<double>, and standard SIMD load/store intrinsics to access the data. That avoids tying the interface and all code that touches it to that specific SIMD vector width and wrapper library. You can still pad the size to a multiple of 8 doubles so you don't have to include cleanup handling in your loops.

However, you might want to use a custom allocator for that vector<double> so you can get it to align your doubles. Unfortunately, even if that allocator's underlying memory allocation is compatible with new/delete, it will have a different C++ type than vector<double> so you can't freely assign / move it to such a container if you use that elsewhere.

I'd worry that if you do ever want to access individual double elements of your vector, doing Vec8vec[i][j] might lead to much worse asm (e.g. a SIMD load and then a shuffle or store/reload from VCL's operator[]) than vecdouble[i*8 + j] (presumably just a vmovsd), especially if it means you need to write a nested loop where you wouldn't otherwise need one.

avec.load (&doublevec[8]); should generate (almost or exactly) identical asm to avec = Vec8vec[1];. If the data is in memory, the compiler will need to use a load instruction to load it. It doesn't matter what "type" it had; types are a C++ thing, not an asm thing; a SIMD vector is just a reinterpretation of some bytes in memory.


But if this is the easiest way you can convince a C++17 compiler to align a dynamic array by 64, then it's maybe worth considering. Still nasty and will cause future pain if/when porting to ARM NEON or SVE, because Agner's VCL only wraps x86 SIMD last I checked. Or even porting to AVX2 will suck.

A better way might be a custom allocator (I think Boost has some already-written) that you can use as the 2nd template param to something like std::vector<double, aligned_allocator<64>>. This is also type-incompatible with std::vector<double> if you want to pass it around and assign it to other vector<>s, but at least it's not tied to AVX512 specifically.

If you aren't using a C++17 compiler (so std::vector doesn't respect alignof(T) > alignof(max_align_t) i.e. 16), then don't even consider this; it will fault when compilers like GCC and Clang use vmovapd (alignment-required) to store a __m512d.

You'll want to get your data aligned; 64-byte alignment makes a bigger difference with AVX512 than with AVX2 on current AVX512 CPUs (Skylake-X).

MSVC (and I think ICC) for some reason choose to always use unaligned load/store instructions (except for folding loads into memory source operands even with legacy SSE instructions, thus requiring 16-byte alignement) even when compile-time alignment guarantees exist. I assume that's why it happens to work for you.

For an SoA data layout, you might want to share a common size for all arrays, and use aligned_alloc (compatible with free, not delete) or something similar to manage sizes for double * members. Unfortunately there's no standard aligned allocator that supports an aligned_realloc, so you always have to copy, even if there was free virtual address space following your array that a non-crappy API could have let your array grow into without copying. Thanks, C++.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Thanks for your thorough, argumented answer, I appreciate it. (Although I don't appreciate the '... like a normal person ...' ) – wvc Feb 05 '21 at 13:45
  • @wvc: it was intended with a wink. I guess I'm going to have to stop using it if people are finding it rude in text, not amusing. :/ – Peter Cordes Feb 05 '21 at 19:17
  • It's ok. Your answer is full of useful information and I'm actively applying your suggestions - I learned a lot here. – wvc Feb 08 '21 at 08:43
  • 1
    std::vector is not guaranteed to be properly aligned. std::vector will be aligned under C++17 which is required for the vector class library anyway. – A Fog Feb 10 '21 at 11:05
3

std::vector will be properly aligned under C++17 which is required for the vector class library anyway. This will work OK. The std::vector template is relatively efficient. Several other standard container templates are very inefficient because they are implemented as linked lists with an awful lot of dynamic memory allocations and de-allocations.

If the size of the array is known at compile time, or if you have a sensible upper limit to the array size, then it may be more efficient to just make an old fashioned C array.

const int arraysize = 0x100;
alignas(64) double myarray[arraysize];  // AVX-512 benefits a lot from alignment
...
Vec8d a;
for (int i=0; i < arraysize/a.size(); i += a.size()) {
    a.load(myarray+i);
    // do your calculations here

}

If the array size is not known at compile time, then you may simply allocate your own array:

Vec8d * mydynamicarray = new Vec8d[mysize];

It is good practice to wrap the memory allocation in a container class with a destructor that cleans up the allocation:

~myContainerClass() {
    if (mydynamicarray != 0) delete[] mydynamicarray;
}
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
A Fog
  • 4,360
  • 1
  • 30
  • 32
1

It depends on how you intend to use some_struct, but rather than use two vectors, one for each member you may prefer:

struct alignas(64) some_struct {
    double[8] a;
    double[8] b;
};

std::vector<some_struct> vector_of_struct_of_arrays{};

I find that my code is usually cleaner with this layout and as was mentioned this allows for the use of a different library in future if you couldn't use vectorclass for some reason.

  • 1
    Remember that C++17 is required for `std::vector` to respect the `alignas(64)`. In C++14 and earlier, std::vector ignored over-alignment requirements. – Peter Cordes Apr 05 '21 at 18:58