Undefined high-order of uint64_t while shifting and masking 32-bit values

Question

I have some undefined behaviour in a seemingly innocuous function which is parsing a double value from a buffer. I read the double in two halves, because I am reasonably certain the language standard says that shifting char values is only valid in a 32-bit context.

inline double ReadLittleEndianDouble( const unsigned char *buf )
{
    uint64_t lo = (buf[3] << 24) | (buf[2] << 16) | (buf[1] << 8) | buf[0];
    uint64_t hi = (buf[7] << 24) | (buf[6] << 16) | (buf[5] << 8) | buf[4];
    uint64_t val = (hi << 32) | lo;
    return *(double*)&val;
}

Since I am storing 32-bit values into 64-bit variables lo and hi, I reasonably expect that the high-order 32-bits of these variables will always be 0x00000000. But sometimes they contain 0xffffffff or other non-zero rubbish.

The fix is to mask it like this:

uint64_t val = ((hi & 0xffffffffULL) << 32) | (lo & 0xffffffffULL);

Alternatively, it seems to work if I mask during the assignment instead:

uint64_t lo = ((buf[3] << 24) | (buf[2] << 16) | (buf[1] << 8) | buf[0]) & 0xffffffff;
uint64_t hi = ((buf[7] << 24) | (buf[6] << 16) | (buf[5] << 8) | buf[4]) & 0xffffffff;

I would like to know why this is necessary. All I can think of to explain this is that my compiler is doing all the shifting and combining for lo and hi directly on 64-bit registers, and I might expect undefined behaviour in the high-order 32-bits if this is the case.

Can someone please confirm my suspicions or otherwise explain what is happening here, and comment on which (if any) of my two solutions is preferable?

Keith Thompson's comment on a deleted answer explained the problem: The `<<` operator is not applied to a `char` operand. The *integer promotions* are applied to the operands, so `char` is promoted to `int` (or conceivably to `unsigned int` on a really bizarre implementation). The right operand (at most 24) is not "greater or equal to the number of bits in the promoted left operand"; the problem is that `foo << 24` can overflow. Casting to `unsigned` should suffice if `int` and `unsigned int` are at least 32 bits wide. Casting to `uint64_t` is probably cleaner. — Tony Delroy, Sep 18 '14 at 01:31
I think @TonyD is correct. Also, depending on the compiler and CPU, some version may compile down to `bswap` and others might not. I couldn't speculate on what the compiler emits though. — Jason, Sep 18 '14 at 01:54
@TonyD that explains why I should explicitly cast before shifting, but doesn't explain why the high-order of the `uint64_t` could come out undefined. On my system, `int` is 32-bits wide, so I don't have overflow (but I will make it explicit). — paddy, Sep 18 '14 at 02:14
They are talking about signed overflow since they result will be unsigned and then converted to signed. — Shafik Yaghmour, Sep 18 '14 at 02:22
One of them should have just wrote an answer or at least did a CW answer, I don't have time now but may before I head off to sleep. — Shafik Yaghmour, Sep 18 '14 at 02:44
aside from the shifting issues, `*(double *)&val` violates the strict aliasing rule. To conform you need to use `memcpy` or a union. — M.M, Feb 23 '16 at 07:37
@paddy `(buf[7] << 24)` may produce a negative number if `buf[7] > CHAR_MAX` (technically undefined behaviour but a common implementation is to produce a negative `int`), so when you assign it to a `uint64_t` it is made positive module 2^64, resulting in FFFF.... on the left. Your "Alternatively" solution still has the undefined behaviour — M.M, Feb 23 '16 at 07:40

score 3 · Accepted Answer · answered Sep 18 '14 at 02:48

If you try to shift a char or unsigned char you're leaving yourself at the mercy of the standard integer promotions. You're better off casting the values yourself, before you try to shift them. You don't have to separate the lower and upper halves if you do so.

inline double ReadLittleEndianDouble( const unsigned char *buf )
{
    uint64_t val = ((uint64_t)buf[7] << 56) | ((uint64_t)buf[6] << 48) | ((uint64_t)buf[5] << 40) | ((uint64_t)buf[4] << 32) |
                   ((uint64_t)buf[3] << 24) | ((uint64_t)buf[2] << 16) | ((uint64_t)buf[1] << 8) | (uint64_t)buf[0];
    return *(double*)&val;
}

All this is necessary only if the CPU is big-endian or if the buffer might not be properly aligned for the CPU architecture, otherwise you can simplify this greatly:

    return *(double*)buf;

`*(double *)buf` violates the strict aliasing rule – M.M Feb 23 '16 at 07:36 — M.M, Feb 23 '16 at 07:36

Undefined high-order of uint64_t while shifting and masking 32-bit values

1 Answers1

Linked

Related