Doing 64bit addition in HLSL, why is one of my implementations producing incorrect results?

Question

I have 2 different implementations of a 64bit add in HLSL. If I want to set A += B, where al, ah, bl, and bh are the low and high 32 bits of A and B respectively, then I do either

(1):

#define pluseq64(al, ah, bl, bh) do {\
    uint tadd0 = al >> 1;\
    uint tadd1 = bl >> 1;\
    tadd0 += al & bl & 0x00000001;\
    tadd0 += tadd1;\
    tadd0 >>= 31;\
    al += bl;\
    ah += bh;\
    ah += tadd0;

or (2):

#define pluseq64(al, ah, bl, bh) do {\
    uint t = al;\
    al += bl;\
    ah += bh;\
    if (al < t) { \
        ah += 1; \
    } } while(0)

Now, interestingly enough, (1) always produces the correct output, whereas (2) does not. Given that (1) is kind of a mess of operations (3 shifts, 5 adds to do a single 64bit +=), I'd much prefer something along the lines of (2) to (1), except that (2) doesn't work properly.

As an alternative to (2), I've tried:

#define pluseq64(al, ah, bl, bh) do {\
    uint t = al;\
    al += bl;\
    ah += bh;\
    ah += (al < t); } while(0)

Which doesn't quite work either (for likely the same reason, whatever that reason is, if I have my guess).

Why doesn't (2) work properly? Bonus: is there a better way to do a 64bit add in HLSL?

Thank you!

It may take me a bit of time to find exactly where it deviates (no breakpoint ability inside of the GPU, blah). Interestingly, when I put in some test cases where the carry bit would be present, both versions produced the correct output... but the issue remains that the cumulative output is different when I use version 1 as opposed to version 2. It's just so bizarre. — MNagy, Jul 27 '15 at 23:39

score 1 · Answer 1 · answered Jul 28 '15 at 06:15

In my testing, the three seem to produce equivalent output on C++, so this is kind of odd. Did you do CPU side testing and did it work for you there? One thing you could try is to skip the macro & do/while stuff and see if it works with a simple HLSL function:

void pluseq64(inout uint al, inout uint ah, in bl, in bh)
{
    uint t = al;
    al += bl;
    ah += bh;
    if (al < t)
    {
        ah += 1;
    }
    // or "ah += uint(al < t); 
}

Functions are inlined in HLSL anyway so I don't think you gain anything from using preprocessor directives.

score 0 · Answer 2 · answered Dec 03 '21 at 04:36

Perhaps your snippet manifested an older driver bug? Stepping through the disassembly with PIX could help. I've used the following without issue on Nvidia/AMD/Intel, which is basically equivalent to your (1).

struct uint64_emulated
{
    uint32_t low;
    uint32_t high;
}

inline uint64_emulated Add(uint64_emulated a, uint64_emulated b)
{
    uint64_emulated c;
    c.low = a.low + b.low;
    c.high = a.high + b.high + (c.low < a.low); // Add with carry.
    return c;
}

Doing 64bit addition in HLSL, why is one of my implementations producing incorrect results?

2 Answers2