In general, for best performance you want to make the loop body as concise as possible and move any code invariant to the iteration out of the loop body. Whether a given compiler is able to do it for you depends on its ability to perform this kind of optimization and on the currently enabled optimization level.
Modern gcc and clang versions are typically smart enough to convert _mm_set1_ps
and similar intrinsics with constant arguments to in-memory constants, so even without hoisting this results in a fairly efficient binary code. On the other hand, MSVC is not very smart in SIMD optimizations.
As a rule of thumb I would still recommend to move constant generation (except for all-zero or all-ones constants) out of the loop body. There are several reasons to do so, even if your compiler is capable to do so on its own:
- You are not relying on the compiler to do the optimization, which makes your code more portable across compilers. The code will also perform more similarly between different optimization levels, which may be useful.
- Moving the constant out of the loop body may convince the compiler to pre-load the constant before entering the loop instead of referencing it in-memory inside the loop. Again, this is subject to compiler's optimization capabilities and current optimization level.
- The constants can be reused in multiple places, including multiple functions. This reduces the binary size and makes your code more cache-friendly in run time. (Some linkers are able to merge equivalent constants when linking the binary, but this feature is not universally supported, it is subject to the optimization level, and in some cases makes the code non-compliant to the C/C++ standards, which can affect code correctness. For this reason, this feature is normally disabled by default, even when supported.)
If you are going to declare a constant in namespace scope, it is highly recommended to use an aligned array and/or a union
to ensure static initialization. Compilers may not be able to perform static initialization if you use intrinsics to initialize the constant. You can use something like the following:
template< typename T >
union mm_constant128
{
T as_array[sizeof(__m128i) / sizeof(T)];
__m128i as_m128i;
__m128 as_m128;
__m128d as_m128d;
operator __m128i () const noexcept { return as_m128i; }
operator __m128 () const noexcept { return as_m128; }
operator __m128d () const noexcept { return as_m128d; }
};
constexpr mm_constant128< float > k05 = {{ 0.5f, 0.5f, 0.5f, 0.5f }};
void foo()
{
// Use as follows:
_mm_sub_ps(paramKnob.v, k05)
}
If you are targeting a specific compiler at a given optimization level, you can inspect the generated assembler code to see whether the compiler is doing good enough optimization on its own.