I would generally use vector<double>
, and standard SIMD load/store intrinsics to access the data. That avoids tying the interface and all code that touches it to that specific SIMD vector width and wrapper library. You can still pad the size to a multiple of 8 doubles so you don't have to include cleanup handling in your loops.
However, you might want to use a custom allocator for that vector<double>
so you can get it to align your doubles. Unfortunately, even if that allocator's underlying memory allocation is compatible with new/delete, it will have a different C++ type than vector<double>
so you can't freely assign / move it to such a container if you use that elsewhere.
I'd worry that if you do ever want to access individual double
elements of your vector, doing Vec8vec[i][j]
might lead to much worse asm (e.g. a SIMD load and then a shuffle or store/reload from VCL's operator[]
) than vecdouble[i*8 + j]
(presumably just a vmovsd
), especially if it means you need to write a nested loop where you wouldn't otherwise need one.
avec.load (&doublevec[8]);
should generate (almost or exactly) identical asm to avec = Vec8vec[1];
. If the data is in memory, the compiler will need to use a load instruction to load it. It doesn't matter what "type" it had; types are a C++ thing, not an asm thing; a SIMD vector is just a reinterpretation of some bytes in memory.
But if this is the easiest way you can convince a C++17 compiler to align a dynamic array by 64, then it's maybe worth considering. Still nasty and will cause future pain if/when porting to ARM NEON or SVE, because Agner's VCL only wraps x86 SIMD last I checked. Or even porting to AVX2 will suck.
A better way might be a custom allocator (I think Boost has some already-written) that you can use as the 2nd template param to something like std::vector<double, aligned_allocator<64>>
. This is also type-incompatible with std::vector<double>
if you want to pass it around and assign it to other vector<>
s, but at least it's not tied to AVX512 specifically.
If you aren't using a C++17 compiler (so std::vector doesn't respect alignof(T) > alignof(max_align_t) i.e. 16), then don't even consider this; it will fault when compilers like GCC and Clang use vmovapd
(alignment-required) to store a __m512d
.
You'll want to get your data aligned; 64-byte alignment makes a bigger difference with AVX512 than with AVX2 on current AVX512 CPUs (Skylake-X).
MSVC (and I think ICC) for some reason choose to always use unaligned load/store instructions (except for folding loads into memory source operands even with legacy SSE instructions, thus requiring 16-byte alignement) even when compile-time alignment guarantees exist. I assume that's why it happens to work for you.
For an SoA data layout, you might want to share a common size for all arrays, and use aligned_alloc
(compatible with free
, not delete
) or something similar to manage sizes for double *
members. Unfortunately there's no standard aligned allocator that supports an aligned_realloc, so you always have to copy, even if there was free virtual address space following your array that a non-crappy API could have let your array grow into without copying. Thanks, C++.