1
    const __m128i ___n = _mm_set_epi32( 0x80808080, 0x80808080, 0x80808080, 0x80808080 );
    const __m128i w___ = _mm_set_epi32( 0x80808080, 0x80808080, 0x80808080, 0x0f0e0d0c );
    const __m128i z___ = _mm_set_epi32( 0x80808080, 0x80808080, 0x80808080, 0x0b0a0908 );
    const __m128i zw__ = _mm_set_epi32( 0x80808080, 0x80808080, 0x0f0e0d0c, 0x0b0a0908 );
    const __m128i y___ = _mm_set_epi32( 0x80808080, 0x80808080, 0x80808080, 0x07060504 );
    const __m128i yw__ = _mm_set_epi32( 0x80808080, 0x80808080, 0x0f0e0d0c, 0x07060504 );
    const __m128i yz__ = _mm_set_epi32( 0x80808080, 0x80808080, 0x0b0a0908, 0x07060504 );
    const __m128i yzw_ = _mm_set_epi32( 0x80808080, 0x0f0e0d0c, 0x0b0a0908, 0x07060504 );
    const __m128i x___ = _mm_set_epi32( 0x80808080, 0x80808080, 0x80808080, 0x03020100 );
    const __m128i xw__ = _mm_set_epi32( 0x80808080, 0x80808080, 0x0f0e0d0c, 0x03020100 );
    const __m128i xz__ = _mm_set_epi32( 0x80808080, 0x80808080, 0x0b0a0908, 0x03020100 );
    const __m128i xzw_ = _mm_set_epi32( 0x80808080, 0x0f0e0d0c, 0x0b0a0908, 0x03020100 );
    const __m128i xy__ = _mm_set_epi32( 0x80808080, 0x80808080, 0x07060504, 0x03020100 );
    const __m128i xyw_ = _mm_set_epi32( 0x80808080, 0x0f0e0d0c, 0x07060504, 0x03020100 );
    const __m128i xyz_ = _mm_set_epi32( 0x80808080, 0x0b0a0908, 0x07060504, 0x03020100 );
    const __m128i xyzw = _mm_set_epi32( 0x0f0e0d0c, 0x0b0a0908, 0x07060504, 0x03020100 );
    const __m128i LUT[16] = { ___n, x___, y___, xy__, z___, xz__, yz__, xyz_, w___, xw__, yw__, xyw_, zw__, xzw_, yzw_, xyzw };

I use a lookup table like the one above for the SSE/SSSE3 version of a compare and left pack routine. A set of compares happens multiple (60) times per second and I'd like it to be in memory instead of set each time and have scope limited to one .c file but trying to define it out of a function with or without static yields the error: initializer element is not constant. for each 'set'. Why does this happen and how can I do this properly?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Timothy s
  • 9
  • 2

3 Answers3

4

The compiler reports that an initializer element is not constant because _mm_set_epi32 is a function call and does not satisfy the “constant” requirements for initializers. Also, the various variables ___n and such that you define do not qualify as constants for initializing LUT.

You can define your array with:

const __v4su LUT[16] =
{
    { 0x80808080, 0x80808080, 0x80808080, 0x80808080 },
    { 0x03020100, 0x80808080, 0x80808080, 0x80808080 },
    { 0x07060504, 0x80808080, 0x80808080, 0x80808080 },
    { 0x03020100, 0x07060504, 0x80808080, 0x80808080 },
    { 0x0b0a0908, 0x80808080, 0x80808080, 0x80808080 },
    { 0x03020100, 0x0b0a0908, 0x80808080, 0x80808080 },
    { 0x07060504, 0x0b0a0908, 0x80808080, 0x80808080 },
    { 0x03020100, 0x07060504, 0x0b0a0908, 0x80808080 },
    { 0x0f0e0d0c, 0x80808080, 0x80808080, 0x80808080 },
    { 0x03020100, 0x0f0e0d0c, 0x80808080, 0x80808080 },
    { 0x07060504, 0x0f0e0d0c, 0x80808080, 0x80808080 },
    { 0x03020100, 0x07060504, 0x0f0e0d0c, 0x80808080 },
    { 0x0b0a0908, 0x0f0e0d0c, 0x80808080, 0x80808080 },
    { 0x03020100, 0x0b0a0908, 0x0f0e0d0c, 0x80808080 },
    { 0x07060504, 0x0b0a0908, 0x0f0e0d0c, 0x80808080 },
    { 0x03020100, 0x07060504, 0x0b0a0908, 0x0f0e0d0c },
};

Since the type is changed from __m128i to __v4su, you may need casts to work with it. (The vector operations are designed for this.)

However, defining it outside of any function and/or with static storage duration does not ensure it will be in memory.

Eric Postpischil
  • 195,579
  • 13
  • 168
  • 312
  • You probably need to reverse the order of the elements, since `_mm_set_epi32` takes the arguments in big-endian order. – chtz Aug 29 '21 at 15:00
  • Just of curiosity, what is the definition (semantically) of `_mm_set_epi32`? – alinsoar Aug 29 '21 at 15:02
  • 2
    @alinsoar: It takes four `int` values, makes a `__v4si` compound literal with them in reverse order, and casts that to an `__m128i`, which is a 16-byte vector of two `long long` (64 bits) elements: `static __inline__ __m128i __DEFAULT_FN_ATTRS _mm_set_epi32(int __i3, int __i2, int __i1, int __i0) { return __extension__ (__m128i)(__v4si){ __i0, __i1, __i2, __i3}; }` – Eric Postpischil Aug 29 '21 at 15:27
  • thanks Eric! Out of curiosity, where is a static variable defined outside of any function located? It's useable by every function in the .c so why wouldn't it be in memory? – Timothy s Aug 29 '21 at 15:30
  • @Timothys: Static data is generally in a program’s data section. It is in the virtual memory of a process. But that memory is virtual: It is managed by the operating system. It only needs to be in physical memory when the program is actually using it. Outside of that, the system may remove it from physical memory (and it can restore it later if the program needs it again). This is generally true of all of the memory of your process: static data section, stack, allocated memory, even the code. – Eric Postpischil Aug 29 '21 at 15:34
  • @Timothys: Using data frequently is usually sufficient to ensure it is in physical memory unless your system is under heavy load. It is usually not necessary to ensure data is always in memory unless it is part of some critical real-time tasks. If it is truly necessary, some operating systems support requests to keep data in physical memory. This may require elevated privileges. However, more commonly, a performance issue may be affected by whether or not data is in cache. – Eric Postpischil Aug 29 '21 at 15:36
  • @Timothys: If you have not observed performance issues and are merely speculating that you need to keep the data in memory as you start to build the application, then it is likely too soon to worry about it. – Eric Postpischil Aug 29 '21 at 15:37
3

_mm_set_epi32 with compile-time-constant args gets optimized to a vector constant, typically loaded from memory. You don't need to help the compiler with this, and in fact it's worse if you try because compilers are strangely bad at it. If you do need/want to help the compiler with constant layout, use an array of some type like alignas(16) static const int32_t LUT[] = {...}; and use _mm_load_si128( (__m128i*)&LUT[i*4] ) or something like that.

Simple vectors like all-zero or all-1 bits can even get materialized in a register with pxor xmm0,xmm0 or pcmpeqd xmm1, xmm1 even more efficiently than a load, so making sure the compiler can see the constant values when optimizing a function is a good reason to define your constants with _mm_set* inside functions.

Think of _mm_set_epi32 as being like a string literal: the compiler figures out where to put it, and will even do duplicate merging if multiple functions need the same vector constant. (Unlike a string literal, it's a value, not something that decays to a pointer to the storage, so only parts of that analogy work.)


The only good place for _mm_set* is inside a function; current compilers suck at handling global / static variable of __m128i type efficiently, failing to handle it as a static initializer so they actually put an anonymous vector constant in section .rodata, and have a run-time constructor / initializer function copy from that to space in the .bss for the named variable in static storage.

(In C, this can only happen for a static __m128i inside a function, which makes it need a guard variable. In C++, non-constant global initializers are allowed, like int foo = bar(123); at global scope even if bar isn't constexpr. In C, you get the error you're running into.)

For example:

#include <immintrin.h>

__m128i foo() {
    return _mm_setr_epi32(1,2,3,4);
}

Compiles with GCC11.2 -O3 (on Godbolt) to this asm. (clang and ICC, and I think MSVC, are all similar for all of the following code blocks.)

    # in .text
foo():
        movdqa  xmm0, XMMWORD PTR .LC0[rip]
        ret

    # in .rodata
.LC0:
        .quad   8589934593     # 0x200000001
        .quad   17179869187    # 0x400000003
__m128i bar(__m128i v) {
    return _mm_add_epi32(v, _mm_setr_epi32(1,2,3,4));
}
bar(long long __vector(2)):
        paddd   xmm0, XMMWORD PTR .LC1[rip]
        ret

        .set    .LC1,.LC0     # make LC1 a synonym for LC0
    # GCC noticed and merged at compile time, not leaving it for the linker.
__m128i retzero() {
    //return _mm_setzero_si128();
    return _mm_setr_epi32(0,0,0,0);  // optimizes the same
}

        pxor    xmm0, xmm0
        ret

But here's what you get from a C++ compiler for a global vector:

__m128i globvec = _mm_setr_epi32(1,2,3,4);
_GLOBAL__sub_I_foo():                          # static initializer code
        movdqa  xmm0, XMMWORD PTR .LC0[rip]        # copy from anonymous .rodata
        movaps  XMMWORD PTR globvec[rip], xmm0     # to named .bss space
        ret
# in .bss
globvec:
        .zero   16

This is what the C error message is "saving" you from; C doesn't allow non-constant static initializers (except in functions).

For a static __m128i inside a function, it would be even worse: you'd get a guard variable to make sure the non-constant initializer ran only in the first call.


Code actually using globvec is basically fine, it will use it as a memory source operand or load it normally, but any optimizations e.g. based on some elements having a known value for constant-propagation through some operations won't be possible. You're also using twice as much space, although the initializer data is only touched once during startup so there isn't an impact on cache footprint.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • 1
    Re `__m128i globvec = _mm_setr_epi32(1,2,3,4);` -- clang is able to avoid initializer function https://godbolt.org/z/TMd84r (I wanted to have this fixed for MSVC, but it is [closed - lower priority](https://developercommunity.visualstudio.com/t/C-optimization:-lookup-up-table-does-n/1306656)) – Alex Guteniev Nov 15 '21 at 15:29
0

Or:

constexpr _m128 _val = {0.5f,0.5f,0.5f,0.5f};
dave_thenerd
  • 448
  • 3
  • 10
  • This works for FP vectors, but not portably for `__m128i` vectors like this question was asking about. In GNU C, `__m128i` is `typedef long long __m128i __attribute__((vector_size(16), may_alias))`, so you'd have to manually combine 32-bit initializers like `{0x100000001, 0x100000001}` to emulate `set1_epi32( 1 )`. – Peter Cordes Nov 15 '21 at 03:40
  • Even worse, `__m128i` a union of a bunch of stuff in Visual C++, and that *compiles* with MSVC but not to the right value. Instead equivalent to `_mm_set_epi32(0,0,0,1)`. https://godbolt.org/z/e1h66c4o6 (as discussed in comments on [Is generally faster do \_mm\_set1\_ps once and reuse it or set it on the fly?](https://stackoverflow.com/posts/comments/123685241)) – Peter Cordes Nov 15 '21 at 03:40