Convert a char to a uint in c++ in a well defined way for binary data handling

Question

So, like the title says, I want to be able to convert between bytes loaded into memory as char* and uints. I have a program that demos some functions that seem to do this, but I am unsure if it is fully compliant with the c++ standard. Is all the casting I am doing legal and well defined? Am I handling sign extension, masking, and truncation correctly? I plan to eventually deploy this code to a variety of different platforms, sometimes with drastically different architectures, and everything I have tried so far seems to imply that this is valid cross platform code to serialize and deserialize my data, but I am more interested about what the standard says than whether or not this works on my particular machines. Here's the small test program to demo the conversion functions:

#include <type_traits>
#include <iostream>
#include <iomanip>

template<typename IntType>
IntType toUint( char byte ) {
    static_assert( std::is_integral_v<IntType>, "IntType must be an integral" );
    static_assert( std::is_unsigned_v<IntType>, "IntType must be unsigned" );
    return static_cast<IntType>( byte ) & 0xFF;
}

template<typename IntType>
void printAs( signed char* cString, const int arraySize )
{
    std::cout << "Values: [" << std::endl;
    for( int i = 0; i < arraySize; i++ )
    {
        std::cout << std::dec << std::setfill('0') << 
            std::setw(3) << toUint<IntType>( cString[i] ) << 
            ": " << "0x" << std::uppercase << std::setfill('0') << 
            std::setw(16) << std::hex << toUint<IntType>( cString[i] );
        if(i < (arraySize - 1) )
        {
            std::cout << ", ";
            std::cout << std::endl;
        }
    }
    std::cout << std::endl << "]" << std::endl;
}
template<typename IntType>
IntType cStringToUint( signed char* cString, const int arraySize )
{
    IntType myValue = 0;
    for( int i = 0; i < arraySize; i++ )
    {
        myValue <<= 8;
        myValue |= toUint<IntType>( cString[i] );
    }
    return myValue;
}

template<typename IntType>
void printAsHex( IntType myValue )
{
    std::cout << "0x" << std::uppercase << std::setfill('0') << 
        std::setw(16) << std::hex << myValue <<std::endl;
}

int main()
{
    const int arraySize = 9;
    // assume Big Endian
    signed char cString[arraySize] = {-1,2,4,8,16,-32,64,127,-128};
    // convert each byte to a uint and print the value
    printAs<uint64_t>( cString, arraySize );
    // notice this trims leading MSB
    printAsHex( cStringToUint<uint64_t>( cString, arraySize ) );
}

Which gives the following output with my compiler:

Values: [

255: 0x00000000000000FF,

002: 0x0000000000000002,

004: 0x0000000000000004,

008: 0x0000000000000008,

016: 0x0000000000000010,

224: 0x00000000000000E0,

064: 0x0000000000000040,

127: 0x000000000000007F,

128: 0x0000000000000080

]

0x02040810E0407F80

So, is this well defined and specified? Can I rest assured that I should get this output every time? I've tried to be thorough, but I would appreciate some second opinions on this at least, or preferably, cite the standard on how casting from char to uint and promoting to a wider type is well defined along with sign extension rules, if it is indeed well defined and specified? I really don't want to have to reach for boost just to do this in a cross platform way.

Also, feel free to assume that I will always be casting to a type of the same or wider width with this. Narrowing casts seem tricky, so I'm just ignoring them for now(I will probably eventually implement some kind of truncation similar to when I do static_cast<IntType>( byte ) & 0xFF; in this code depending on the width of input and desired types.

You will get a consistent `uint64_t`, but that `uint64_t` will only be affected by the last 8 characters in your array. That's how many bytes are in a `uint64_t`. — Drew Dormann, Jul 14 '22 at 16:42
@Drew Dormann Thank you, that is desirable in my case as this is intentional to make it so it should be somewhat robust against overflow errors. If you examine the comments, the truncation is expected and desired in this case to trim to just the 8 LSB. — alrav, Jul 14 '22 at 16:49
Don't use `char` as that can be signed or unsigned. Use `std:::byte` or `unsigned char`. Your code is UB/implementation defined behavior before c++20. — Goswin von Brederlow, Jul 14 '22 at 23:38
I am using --std=c++2a with gcc 9, so hopefully this won't be an issue, but I am curious, where does the UB get invoked if I wanted to port this to an older compiler version that lacks that support? Is it something to do with the possible sign conversion when performing static_cast( byte )? Is that fixed in c++20 somehow because it guarantees a 2's compliment representation. I would like to support char as a lot of APIs support it, like fstream and memcpy. I actually use uint8_t for internal code. — alrav, Jul 15 '22 at 15:02
@GoswinvonBrederlow does replacing `return static_cast( byte ) & 0xFF;` with ```if constexpr ( std::is_unsigned_v ) { rtnVal = static_cast( byte ) & 0xFF; } else { rtnVal = static_cast( std::bit_cast< std::make_unsigned_t >( byte ) ) & 0xFF; }``` fix it? Full code example in godbolt because reply formatting is annoying: https://godbolt.org/z/qb19a4n8G — alrav, Jul 15 '22 at 19:35
What if CharType is a wide char? You would only use half the bytes. You should choose ONE type for your IO buffers and stick with that. — Goswin von Brederlow, Jul 15 '22 at 21:18
Ah, that is a good point, though I added a check for the size of the CharType size using CHAR_BIT. And I would love to just use uint8_t for everything, but sadly many apis and legacy systems take things as char* and void* as well, and I want this to interoperate with them smoothly. — alrav, Dec 09 '22 at 17:23

Convert a char to a uint in c++ in a well defined way for binary data handling

0 Answers0