Loading 128 bits of mixed float+int data?

Question

I have a struct which has the following composition:

static constexpr uint64_t emptyStructValue { 0 };

union MyStruct {
    explicit MyStruct(uint64_t comp) : composite(comp){}

    struct{
        int16_t a;    
        bool b;
        bool c;
        float d;
    };

    uint64_t composite = 0; 

    bool hasValue(){
        return composite != emptyStructValue;
    }
};

and I have two of these structs in another object:

class B{
    Struct s1;
    Struct s2;
};

and I would like to know, given object of type B, how could I load all 128 bits in to an SSE register and check whether a single bit is set?

I found _mm_loadu_si128() but my data has a mixture of ints and floats?

You just want to check if at least one bit is set? Just to make sure, what do you expect when doing this on floats? For ints (and twos complement), it´s the same as checking if it is not 0, but floats are more complicated => What exactly is the purpose? — deviantfan, Jun 07 '15 at 13:06
If there is nothing in the struct, I expect all 64 bits to be 0. I just want to see if a struct contains an entry, quickly, hence I just want to see if any bits are set? The struct has a union, which sets the 64 bits to zero if its empty. — user997112, Jun 07 '15 at 13:07
Instances of your struct always contain all 4 variables with some values? It can´t be emtpy. edit: And the union stuff doesn´t make it empty too. — deviantfan, Jun 07 '15 at 13:08
I have showed the struct in full. I basically want to be able to load two structs, using one memory load, to check whether at least one is populated with a non-zero for the "composite"? — user997112, Jun 07 '15 at 13:12
"All 64 bits zero" - you're silently assuming that `bool` has 7 padding bits, all of which are zero. — MSalters, Jun 07 '15 at 13:15
I just remembered that (one of) the IEEE754 0 representations is really the 0-only bit mask. So, even leaving the bool problem (and the possibility of different variable sizes) aside, you can´t distinguish between "empty" and "all 4 variables are 0". — deviantfan, Jun 07 '15 at 13:18
@TonyK each struct is 64 bits, I want to load 128 bits, ie two structs stored contiguously. — user997112, Jun 07 '15 at 13:18
@deviantfan As I understand it, if I set the struct equal to emptyStructValue, all 64-bits would be zero, due to the union with uint64_t composite? — user997112, Jun 07 '15 at 13:20
@user997112 About the bool issue: It´s not guarenteed what numeric byte values "true" and "false" are as long as they are in the bool (only conversions to integers are exactly 1 and 0) — deviantfan, Jun 07 '15 at 13:20
@user997112 About the zero thing: Yes (well, not even that is guaranteed in theory. Doesn´t matter for now). But all 64 bit zero could mean "emtpy" *or* "not empty, a is 0, b is 0, c is 0, d is 0". A empty-flag has to be separate if you don´t want this. — deviantfan, Jun 07 '15 at 13:21
@user997112: Yes. Trivially, there are compilers which have `sizeof(bool)>1` and secondly, there's no guarantee at all about the value of those padding bits. They're quite literally non-value bits. — MSalters, Jun 07 '15 at 13:28
If I set the struct to empty, using emptyStructValue , composite uint64_t = 0 means all bits MUST be zero? — user997112, Jun 07 '15 at 13:36
@user997112 Usually yes, all bits of composite will be 0 then. But technically not even this is guaranteed, and "all bits of composite" isn´t necessarily the same as "all bits of the union" (if you think so). And even if, it just makes no sense. Just use a separate empty flag (or tell us what the purpose of this whole thing is, because there might be a better solution) — deviantfan, Jun 07 '15 at 13:51
You should make your `bool`s `int8_t`s so we have a guarantee of size and value. But, leaving that aside: Yes, you can load two 64-bit unions like this, whether they be a mix of integer data and float or not; And since you have AVX, you're likely to also have SSE4.1, in which case the `pcmpeqq` instruction (accessible with `__m128i _mm_cmpeq_epi64(__m128i, __m128i)`) is exactly what you want. — Iwillnotexist Idonotexist, Jun 07 '15 at 14:15
@IwillnotexistIdonotexist: Use `ptest` of a register with itself to check if any of the bits are non-zero. Just like the `test` instruction: bitwise AND, and set flags. SSE4.1. And yes, AVX implies all the SSE instruction sets. — Peter Cordes, Jul 01 '15 at 20:33

TonyK · Answer 1 · 2015-06-08T13:12:54.487

3

In practical terms, if (sizeof(B) == 2*sizeof(uint64_t), then I see no reason not to do what you suggest. But if speed is important (and it looks like it is), you should align your B object to a 128-bit boundary, so that you can use _mm_load_si128 instead of _mm_loadu_si128.

Edited to add: In fact, in 64-bit mode, it's probably faster just to use the regular opcodes. Something like:

mov   rax,[rsi]
or    rax,[rsi+8]
jnz   BitSet

Even in 32-bit mode, it might turn out to be faster. You will have to experiment.

edited Jun 08 '15 at 13:12

answered Jun 07 '15 at 14:02

TonyK

16,761
4
37
72

Created B on the stack, aligned to 128 bits. Then did __declspec(align(128)) B b; B* bptr = &b; __m128i intrinreg = _mm_load_si128(reinterpret_cast<__m128i*>(bptr)); but unsure how to compare the __m128i with the number zero? – user997112 Jun 07 '15 at 15:29
@user997112: Hmmm, good question. I can't see a way to do it. Perhaps the best way is just `if (s1.composite == 0 && s2.composite == 0)`. – TonyK Jun 07 '15 at 15:49
@user997112 [There's an SO question about just that](http://stackoverflow.com/questions/10175711/check-xmm-register-for-all-zeroes), if `pcmpeqq` isn't good enough for you. – Iwillnotexist Idonotexist Jun 08 '15 at 04:55
@IwillnotexistIdonotexist: `pcmpeqq` doesn't set any flags, it just sets the destination SSE register accordingly. Which leaves you back where you started. – TonyK Jun 08 '15 at 07:53
`pcmpeqq` is required to test the 64-bit integers individually for equality. OF course, after that follows `pmovmaskb` to extract the comparison results; This later instruction will return `0x0000` if neither comparison to zero was true, `0xFF00` or `0x00FF` if one of them was, and `0xFFFF` if both were. – Iwillnotexist Idonotexist Jun 08 '15 at 11:51
@IwillnotexistIdonotexist: That's my point! All that copying and comparing is just what we were trying to avoid in the first place. – TonyK Jun 08 '15 at 12:00
@TonyK But that's the absolute bare minimum work you need to do... if you had instead done it on the scalar side, you'd need 1 load, 1 compare and 1 branch for _each_ `union`, whereas here we just need 1 load, 1 compare, 1 of either `pmovmskb` or `ptest` and 1 branch per _2_ `union`s. If you have AVX2 then that's per _4_ `union`s. Can't get much better than that... – Iwillnotexist Idonotexist Jun 08 '15 at 12:07

Loading 128 bits of mixed float+int data?

1 Answers1