0

I have two variable bit-shifting code fragments that I want to SSE-vectorize by some means:

1) a = 1 << b (where b = 0..7 exactly), i.e. 0/1/2/3/4/5/6/7 -> 1/2/4/8/16/32/64/128/256
2) a = 1 << (8 * b) (where b = 0..7 exactly), i.e. 0/1/2/3/4/5/6/7 -> 1/0x100/0x10000/etc

OK, I know that AMD's XOP VPSHLQ would do this, as would AVX2's VPSHLQ. But my challenge here is whether this can be achieved on 'normal' (i.e. up to SSE4.2) SSE.

So, is there some funky SSE-family opcode sequence that will achieve the effect of either of these code fragments? These only need yield the listed output values for the specific input values (0-7).

Update: here's my attempt at 1), based on Peter Cordes' suggestion of using the floating point exponent to do simple variable bitshifting:

#include <stdint.h>
typedef union
{
    int32_t i;
    float f;
} uSpec;
void do_pow2(uint64_t *in_array, uint64_t *out_array, int num_loops)
{
    uSpec u;
    for (int i=0; i<num_loops; i++)
    {
        int32_t x = *(int32_t *)&in_array[i];
        u.i = (127 + x) << 23;
        int32_t r = (int32_t) u.f;
        out_array[i] = r;
    }
}
nickpelling
  • 119
  • 9
  • 1
    Does `b = 0..7 exactly` mean that `b` is *constant*? That seems odd because then there is no operation to do, just use different constants. And is the bit width 64 as you hint at with `VPSHLQ`? – harold Oct 12 '19 at 18:17
  • 1
    Do you mean [AVX2 `vpsllvq`](https://www.felixcloutier.com/x86/vpsllvw:vpsllvd:vpsllvq) for per-element-variable shift count? Anyway, you might be able to stuff `b` into the exponent of a `double`, but there's no SIMD packed double->int64 until AVX512. So maybe stuff into `float` and do float->int, then shuffle? – Peter Cordes Oct 12 '19 at 19:02
  • harold: b is not constant, but it is always a variable exactly in the range [0..7]. And yes, I'm using 64-bit width values, but if there are tricks for other bitwidths there may be a way of spoofing from one to the other. :-) – nickpelling Oct 13 '19 at 19:22
  • Peter Cordes: very nice idea indeed, I'll give it a go and see if I can make it work... thanks! :-) – nickpelling Oct 13 '19 at 19:24
  • 2
    For 8bit, the first case is easy to do with `pshufb`, and it can be adapted to 16, 32, 64 just by doing an OR first (to set the high bit of every byte that should come out as zero) – harold Oct 13 '19 at 21:49
  • harold: also a nice idea! As I understand it, you're suggesting using pshufb(0x8040201008040201ULL,variableshift) & 0xFF, is that right? – nickpelling Oct 18 '19 at 08:35
  • Peter Cordes: here's my first attempt at using the floating point exponent to do simple variable bit shifting: – nickpelling Oct 18 '19 at 16:30
  • Ok, does that auto-vectorize? Also, type-punning via pointer casting violates strict aliasing (and is UB). Did you get less efficient asm when you used `int32_t x = in_array[i]` like a normal person? – Peter Cordes Oct 18 '19 at 21:04
  • Peter Cordes: it does indeed autovectorize (I put it into godbolt.org).I wanted to go from a uint64_t to a uint64_t, which is why the code dances around a little. But because it uses floats, normal SSE ops do four of these at a time, before expanding out into uint64_t vars, which is a nice bonus. :-) – nickpelling Oct 19 '19 at 20:53
  • Having said that, harold's suggestion (using pshufb) would be able to map 16 [0..7] byte values to 1<<[0..7] values in a single op, though how on earth I could get gcc to autovectorize that is quite beyond me. If I was using intrinsics, that would be the clear winner, though. :-/ – nickpelling Oct 19 '19 at 20:58

0 Answers0