Is generally faster do _mm_set1_ps once and reuse it or set it on the fly?

Question

Here's two example. Set float values on a vector every time within a loop:

static constexpr float kFMScale = 10.0f / k2PIf; // somewhere...

for (int i = 0; i < numValues; i++) {
    paramKnob.v = _mm_mul_ps(_mm_sub_ps(paramKnob.v - _mm_set1_ps(0.5f)), _mm_set1_ps(2.0f));
    paramKnob.v = _mm_mul_ps(paramKnob.v, _mm_set1_ps(kFMScale));
    
    // code...
}

or do it once and "reload" the same register every time:

static constexpr float kFMScale = 10.0f / k2PIf; // somewhere...

__m128 v05 = _mm_set1_ps(0.5f);
__m128 v2 = _mm_set1_ps(2.0f);
__m128 vFMScale = _mm_set1_ps(kFMScale);
for (int i = 0; i < numValues; i++) {
    paramKnob.v = _mm_mul_ps(_mm_sub_ps(paramKnob.v, v05), v2);
    paramKnob.v = _mm_mul_ps(paramKnob.v, vFMScale);

    // code...
}

which one is generally the best and suited approch? I'll bet on the second, but vectorization most of the time fool me.

And what if I use them as const in the whole project instead of "within a block"? "load everywhere" instead of "set everywhere" would be better?

i think its all about cache missing rather than latency/throughput used by the operations.

I'm on a windows/64 bit machine, using FLAGS += -O3 -march=nocona -funsafe-math-optimizations

An optimizing compiler will typically hoist those SSE constants out of the loop in the first case. I would consider using: `(2.0f * kFMScale)` instead of two separate multiplications, but this may not have any effect when using: `-funsafe-math-optimizations`. — Brett Hale, Sep 01 '21 at 09:42
The placement of these things does not reliably influence the place in the machine where the constant is actually loaded — harold, Sep 01 '21 at 09:49

Andrey Semashev · Answer 1 · 2021-09-01T20:03:33.057

0

In general, for best performance you want to make the loop body as concise as possible and move any code invariant to the iteration out of the loop body. Whether a given compiler is able to do it for you depends on its ability to perform this kind of optimization and on the currently enabled optimization level.

Modern gcc and clang versions are typically smart enough to convert _mm_set1_ps and similar intrinsics with constant arguments to in-memory constants, so even without hoisting this results in a fairly efficient binary code. On the other hand, MSVC is not very smart in SIMD optimizations.

As a rule of thumb I would still recommend to move constant generation (except for all-zero or all-ones constants) out of the loop body. There are several reasons to do so, even if your compiler is capable to do so on its own:

You are not relying on the compiler to do the optimization, which makes your code more portable across compilers. The code will also perform more similarly between different optimization levels, which may be useful.
Moving the constant out of the loop body may convince the compiler to pre-load the constant before entering the loop instead of referencing it in-memory inside the loop. Again, this is subject to compiler's optimization capabilities and current optimization level.
The constants can be reused in multiple places, including multiple functions. This reduces the binary size and makes your code more cache-friendly in run time. (Some linkers are able to merge equivalent constants when linking the binary, but this feature is not universally supported, it is subject to the optimization level, and in some cases makes the code non-compliant to the C/C++ standards, which can affect code correctness. For this reason, this feature is normally disabled by default, even when supported.)

If you are going to declare a constant in namespace scope, it is highly recommended to use an aligned array and/or a union to ensure static initialization. Compilers may not be able to perform static initialization if you use intrinsics to initialize the constant. You can use something like the following:

template< typename T >
union mm_constant128
{
    T as_array[sizeof(__m128i) / sizeof(T)];
    __m128i as_m128i;
    __m128 as_m128;
    __m128d as_m128d;

    operator __m128i () const noexcept { return as_m128i; }
    operator __m128 () const noexcept { return as_m128; }
    operator __m128d () const noexcept { return as_m128d; }
};

constexpr mm_constant128< float > k05 = {{ 0.5f, 0.5f, 0.5f, 0.5f }};

void foo()
{
    // Use as follows:
    _mm_sub_ps(paramKnob.v, k05)
}

If you are targeting a specific compiler at a given optimization level, you can inspect the generated assembler code to see whether the compiler is doing good enough optimization on its own.

edited Sep 01 '21 at 20:03

answered Sep 01 '21 at 10:10

Andrey Semashev

10,046
1
17
27

1

*The constants can be reused in ... multiple functions* - huh? You almost never want to use a `static __m128i` at global scope, compilers do extra bad with that, but do merge duplicate constants in separate functions at least at link time if not always compile time. [fC - How can I define SIMD variable(s) outside of a function?](https://stackoverflow.com/a/68976281) (Except for some GCC bugs where different vector types make it clone a constant even if you only define it once). – Peter Cordes Sep 01 '21 at 16:11
FWIW: MSVC figured out the constants just fine on godbolt... so I'd keep the compiler slander to a minimum. – Mgetz Sep 01 '21 at 16:35
@PeterCordes > You almost never want to use a `static __m128i` at global scope -- I didn't say `static`. Although `static` (i.e. internal linkage, in C++ terms) is fine as long as the constant is used in a single TU. Constant folding is non-conforming most of the time as it breaks the language guarantee that every object must have a unique address. It would only work if the constant only appears as a temporary, in which case you have to rely on the compiler doing the right thing. – Andrey Semashev Sep 01 '21 at 19:43
@AndreySemashev: non-`static` `__m128i` at global scope is at least as bad, and worse in terms of stopping the compiler from removing it entirely if it e.g. replaces a multiply by 2 with a self-add. You'll still get a runtime initializer to copy from .rodata to .bss, like in my linked answer, unless you do stuff like in your answer with a manual union so you're actually initializing an array. – Peter Cordes Sep 01 '21 at 20:15
1

Wrapping in a union does fix that initialization problem, although passing the union by value may be less efficient than a raw `__m128` depending on the calling convention. Seems fine on Windows (https://godbolt.org/z/vjGMxxnxo), although if the accessor function didn't inline it would apparently return by hidden pointer instead of in XMM0, even with vectorcall. – Peter Cordes Sep 01 '21 at 20:17
@PeterCordes Obviously, you need to mark the constant with `const` or `constexpr` to avoid runtime initialization in .bss. It's the right thing to do regardless. Re multiply vs. self-add - just write the code the way you want in the first place, i.e. use addition instead of multiplication. Re union passing - you don't pass the union by value, you pass the vector type, either by value or by reference. – Andrey Semashev Sep 01 '21 at 20:34
1

You'd hope that `const __m128 v = _mm_set1_ps(0.5f);` would be statically initialized in .rodata, but it's not with GCC or MSVC. https://godbolt.org/z/5136fbv3E. It seems clang does manage to constprop it so a function that returns it loads from anonymous .rodata, not the `v` symbol name (https://godbolt.org/z/GMqnf7r8x), and doesn't AFAICT reserve anything in the BSS for it. So that's interesting, but specific to clang. But in that case, it's (exactly?) equivalent to just using the `set1` constant inside each function separately, like you should do with GCC and MSVC. – Peter Cordes Sep 01 '21 at 21:13
2

This would be a better answer if you showed concrete evidence of missed optimizations that you need to work around with complicated stuff like this union. As @Mgetz says, even recent MSVC is not that bad at things. It used to be unable to hoist `_mm_set` constants out of loops after inlining if they were defined in helper functions, but that's not the case currently, IIRC. – Peter Cordes Sep 01 '21 at 21:16
@PeterCordes As I said in the answer, it can be useful to deduplicate the constants, including across multiple TUs. I admit that I didn't have a concrete example when I mentioned MSVC; I almost don't use that compiler, but in my past experience MSVC was pretty terrible. It's good that it becomes better. Bottom line is that hoisting the constant out of the loop, including to namespace scope, is never bad; it is definitely good with poor compilers or when optimization is off or when you want to merge constants. I do not see the reason to recommend otherwise. – Andrey Semashev Sep 01 '21 at 22:25
I think modern toolchains do deduplicate constants (even at link time I think), the same way they deduplicate string literals. As I showed, it *is* less good if you manually hoist constants out of functions yourself the obvious way (global `const __m128i`) rather than making an aligned array to load from, or using your complicated union wrapper. – Peter Cordes Sep 01 '21 at 22:43
2

Unless you check and find your compiler doing a bad job, I'd recommend defining your constants in a place that maximizes readability, often at the top of a loop or next to a shuffle or other operation you're about to do. Using the same vector constant in multiple functions would be a good reason to define it in one place, if changes to one *should* be changes to others, not just coincidence of happening to use the same value. (i.e. compilers are good enough that we can care about maintainability without hurting performance.) – Peter Cordes Sep 01 '21 at 22:44

dave_thenerd · Answer 2 · 2021-11-15T02:59:23.830

-1

Or do this:

Declare a compile-time constant __m128 outside your function, preferably in a "constants" namespace.

 inline constexpr __m128 _val = {0.5f,0.5f,0.5f,0.5f};
 //Your function here

This will save you from using the _mm_set1_ps function, which does not evaluate to 1 instruction, it is a rather inefficient sequence of instructions.

Since templates are resolved at compile-time, there will be little to no performance difference between this and the other answer. However, I think this code is significantly simpler.

edited Nov 15 '21 at 02:59

answered Nov 15 '21 at 02:42

dave_thenerd

448
3
10

Where do you do this? Why would this be better? How does it compare to the performance of the two examples given in the question? Can you please [edit] your answer to clarify? – Cody Gray - on strike Nov 15 '21 at 02:44
*which does not evaluate to 1 instruction, it is a rather inefficient sequence of instructions.* - That's not actually true with constant args. `_mm_add_ps(v, _mm_set1_ps(0.5))` compiles to a memory-source `addps` with a constant vector in read-only memory (`.rodata`), on GCC, clang, MSVC, and ICC: https://godbolt.org/z/cxGrPGzer. (GCC can even merge a `constexpr __m128` with a `_mm_set1_ps()` of the same value, so that's fun). Plus, this is quite *inconvenient* for `__m128i` integer vectors; no portable syntax to init them like this. – Peter Cordes Nov 15 '21 at 03:19
Even worse, https://godbolt.org/z/e1h66c4o6 shows that MSVC *compiles* constexpr `__m128i vint = {0x100000001, 0x100000001};`, but to the equivalent of `_mm_set_epi32(0,0,0,1)`, not like GNU C where that initalizes the two uint64_t halves of the vector. And shows that compilers including MSVC de-duplicate `_mm_set1_ps(0.5)` in two separate functions, so that's not a benefit either. I didn't check for link-time duplicate merging across compilation units, in case you use the same vector constant in multiple files (vs. letting them see the same-named `constexpr`). – Peter Cordes Nov 15 '21 at 03:29
Just for fun, https://godbolt.org/z/5zqc345YG shows GCC and clang able to compress the memory constants with AVX-512, using broadcast memory source operands, regardless of how your write them. So at least the `constexpr` declared outside the function isn't *hurting*, unlike if you'd used `static const .. = _mm_set` or something. And clang turns the integer add-one into `pcmpeqd same,same` / `psubd` to subtract -1. (Clang doesn't de-duplicate constants at compile time, it leaves that up to the linker.) – Peter Cordes Nov 15 '21 at 03:34

Is generally faster do _mm_set1_ps once and reuse it or set it on the fly?

2 Answers2