2

I can't figure out how to convert 4 x 32 bit signed integers stored in a single __m128i into "unsigned" counterparts. The conversion should be done with value truncation, clamping negative numbers to 0 but leaving non-negative numbers unchanged.

E.g: -100 should turn into 0, while 100 should remain 100

#include <stdio.h>
#include <cstdint>
#include <emmintrin.h>

int main()
{    
    alignas(16) uint32_t out32u[4];
    __m128i my = _mm_setr_epi32 (100, -200, 0, -500);
    <....missing code....>
    _mm_store_si128(reinterpret_cast<__m128i *>(out32u), my);
    printf("%u %u %u %u\n", out32u[0], out32u[1], out32u[2], out32u[3]);
}

So given the <....missing code....> additions the result of the code above should become:

100 0 0 0

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
lhog
  • 105
  • 1
  • 8

1 Answers1

4

Use SSE4.1 _mm_max_epi32 as:

my = _mm_max_epi32(my, _mm_setzero_si128());

Or without that, @chtz's elegant a & ~(a >> 31) can be implemented using SSE2 as follows:

my = _mm_andnot_si128(_mm_srai_epi32(my, 31), my);

Replace <....missing code....> with the above line.

Generated assembly for both methods.

Maxim Egorushkin
  • 131,725
  • 17
  • 180
  • 271
  • @PeterCordes Thanks for `_mm_setzero_si128`, wasn't aware of it. But was pleasantly surprised that the generated assembly is the same https://gcc.godbolt.org/z/bTGTdv – Maxim Egorushkin Nov 28 '20 at 22:00
  • 1
    I was hoping it would fix GCC's braindead choice to copy `my` to xmm1, then xor-zero xmm0, but it doesn't. Not as pleasant as I'd hoped. But yeah, compilers know zeroing idioms, that's what they're for. Just like scalar `return 0` will `xor eax,eax`. You'd also get the same asm from `_mm_set1_epi32(0)`, or from `_mm_set_epi32(0,0,0,0)`. `_mm_setzero_si128()` isn't magic in modern compilers; IDK if ancient compilers used to need help to use xor-zeroing, but they don't now. It's only useful for the semantic meaning for human readers. – Peter Cordes Nov 28 '20 at 22:04
  • @PeterCordes Yep, I noticed sub-optimal `gcc` register allocation in `g` and `h` too. I am tired of reporting [poor register allocation bugs to gcc](https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91796). On the other hand, quite surprisingly, for `f` `gcc` allocates registers better than `clang` does. – Maxim Egorushkin Nov 28 '20 at 22:20
  • GCC's register allocator apparently is well-known to be bad at tiny functions, where it has to obey the calling convention's hard-register constraints for both input and output. But it's usually not a problem inside larger functions. (wasted mov instructions are sometimes present even in loops, though). Clang is usually better, but does stumble sometimes, too. – Peter Cordes Nov 28 '20 at 22:25
  • @PeterCordes I heard that argument but in the absence of evidence I am not convinced that `gcc` allocating registers sub-optimally for small functions magically does better on larger ones. The former doesn't imply or cause the latter. `gcc` register allocation faults can ideally possibly cancel each other, or, in the worst case, compound. – Maxim Egorushkin Nov 28 '20 at 22:28
  • I've heard this argument from GCC devs, not just people making up guesses to explain observations, so I'm more inclined to believe it. (Although I haven't looked at the code myself to really understand the reasons.) It makes some sense that the reg allocator has special code to handle specific hard-register requirements, so it's certainly plausible it does worse with that than when spilling/reloading for register pressure, or just chaining together sequences of operations and letting dead values die. – Peter Cordes Nov 28 '20 at 22:32
  • @PeterCordes Well, the fact that `clang` often allocates registers better in small functions than `gcc` means that `gcc` does a worse job within the same constrains and negates the that "small function register constraint" or nebulous "specific hard-register requirements" argument. Both compilers obey the same ABI requirements. I may be wrong, but that's what it appears to me based on the above logic. – Maxim Egorushkin Nov 28 '20 at 22:36
  • Nobody's saying it's impossible in general to do well! Just that *GCC's* register allocation algorithm is sub-optimal for that case, moreso than for other cases. (But as we can see, clang also has similar missed optimizations sometimes, even for functions that are trivially simple for humans.) Keep in mind that compiler algorithms need to be fast even for large problem sizes, which often means they can't explore the full range of the problem space to find optimal solutions if there's no efficient-enough algorithm. (Like O(n^2) might be ok, but not exponential.) – Peter Cordes Nov 28 '20 at 22:47
  • @PeterCordes I am only saying that gcc devs' argument lacks any supporting evidence and that's why I cannot trust it. Nothing prevents `gcc` register allocator from making different decisions based on function size that results in ideal register allocation. – Maxim Egorushkin Nov 28 '20 at 22:53
  • I don't see your point at all. I think you're countering some argument that it would be impossible to ever fix gcc, but that's not the argument GCC devs are making. Just explaining that GCC's current code isn't great for this case. (With the implication they think it's not important and don't plan to fix it - inline small funcs.) If you look at asm for larger functions, there aren't wasted `mov` instructions all over the place, not to this degree. So it certainly appears that it's just at ABI boundaries that it does worse. (And probably around `asm` statements with hard-reg constraints.) – Peter Cordes Nov 28 '20 at 23:38
  • TL:DR: I don't see any reason not to take the word of GCC devs about the current state of GCC's internals, barring any concrete evidence that contradicts what they say. Of course it could be improved by adding more special cases, if GCC devs would accept such a patch. – Peter Cordes Nov 28 '20 at 23:39