Setting proper alignment of packed long long on GCC to use use with avx2 instructions

Question

Introduction:

I'm writing a function to process 4 packed long long int in x86_64 assembly using AVX2 instruction. Here is how my header file looks like:

avx2.h

#define AVX2_ALIGNMENT 32

// Processes 4 packed long long int and 
// returns a pointer to a result 
long long * process(long long *);

The assembly implementation of the process function looks as follows:

avx2.S:

global process

process:
    vmovaps ymm0, [rdi]
    ;other instructions omitted

The vmovaps ymm0, [rdi] requires rdi to be 32-bytes aligned. In assembly it is controlled by the align 32 directive.

The problem:

When compiling with GCC it has the __BIGGEST_ALIGNMENT__ definition which on my implementation is 16. The C18 Standard at 6.2.8/3 claims that

An extended alignment is represented by an alignment greater than _Alignof (max_align_t). It is implementation-defined whether any extended alignments are supported and the storage durations for which they are supported.

So the implementation-defined extended alignment on GCC is also 16 and I'm not sure if the code causes UB:

#include "avx2.h"

//AVX2_ALIGNMENT = 32, __BIGGEST_ALIGNMENT__ = 16
_Alignas(AVX2_ALIGNMENT) long long longs[] = {1, 32, 432, 433};
long long *result = process(longs);

Is there a way to rewrite the code without UB? (I'm aware about intrinsic immintrin.h, this is not the topic of the question).

You could of course just use `vmovups` instead of `vmovaps` and forget about alignment. — Paul R, Aug 10 '19 at 10:56
@PaulR I just checked the latency/TP/uops for the [`vmovaps`](https://uops.info/html-instr/VMOVAPS_M256_YMM.html)/[`vmovups`](https://uops.info/html-lat/SKL/VMOVUPS_M256_YMM-Measurements.html) and noticed that they are almost the same (both have 10c latency and 2uops) on `Skylake` so propabably `vmovups` should be prefferable... No? — St.Antario, Aug 10 '19 at 11:08
if your data happens to be 32 byte aligned then there should be no noticeable difference in performance. If it's not aligned then you may see a relatively small difference in performance due to fairly subtle cache/memory issues, which may or not be significant in your use case, depending on just how performance-critical the code is, you memory access pattern, and how much computation you're doing after the initial load. — Paul R, Aug 10 '19 at 11:16

score 3 · Accepted Answer · answered Aug 10 '19 at 11:01

Your code is already free of UB. Any decent compiler would error on an _Alignas() that it didn't support.

Note that the standard says presence/absence of this support is implementation-defined. It doesn't mention UB anywhere. An implementation should know what it supports and check at compile time whether it can support a given _Alignas or not.

A bad low-quality implementation could I guess decide that too-high values for _Alignas() were UB. I haven't actually checked.

The implementations that can compile this code (gcc/clang/MSVC/ICC) all support at least _Alignas(256) for automatic and static storage, AFAIK. (I left out SunCC which may still be around and may have AVX2 support. I assume it's fine too but I haven't looked at its asm output) Probably nearly arbitrarily large, especially for static storage.

All of those compilers definitely do know how to over-align the stack to 32 or 64, so there's no reason they can't do it to arbitrarily large except for stack-size limits.

It should be safe to assume that every compiler that supports Intel intrinsics also supports extended alignments for _Alignas() at least up to the size of a couple cache lines.

(FYI, you can #include <alignof.h> so you can use alignas() the same as you would in C++).

Caveat: MinGW stack alignment for `__m256` variables

Last I heard, MinGW is still broken. It knows how to align the stack for _Alignas(32), but fails to do so for __m256/__m256i/d variables, potentially spilling/reloading them with misaligned vmovaps.

Or something like this. If you care about MinGW, better look into this. Or just use clang when targeting Windows.

__An implementation should know what it supports and check at compile time whether it can support a given _Alignas or not.__ I tested it with my `gcc-7.4.0` and the maximum allowed alignment which worked fine was `8192 * 8192 * 4`. Trying to set `_Alignas(8192 * 8192 * 8)` causes `error: requested alignment is too large`. — St.Antario, Aug 10 '19 at 11:25
@St.Antario: oh good, thanks for testing that at least gcc works the way I expected. — Peter Cordes, Aug 10 '19 at 11:27

Setting proper alignment of packed long long on GCC to use use with avx2 instructions

1 Answers1

Caveat: MinGW stack alignment for __m256 variables

Caveat: MinGW stack alignment for `__m256` variables