Undefined but correct behavior in C++

Question

The program below makes assumptions about the packing of bit-fields, type punning and object representation. In other words, it makes no pretense at portability. Still, it has the advantage of being fast. Can this program be said to be correct with the limitation of requiring a supported compiler and a platform with the right endiness?

#include <cassert>
#include <cstdint>

union Data {
    uint8_t raw[2];
    struct __attribute__((packed)) {
        unsigned int field1: 3, field2: 3, field3: 1, field4: 2;
        unsigned int field5: 7;
    } interpreted;
};


int main() {
    static_assert(sizeof(Data) == 2, "Struct size incorrect");
    static_assert(__BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__, "Only little-endian platform support is currently implemented");
    static_assert(
        #if defined(__GNUC__) || defined(__clang__)
                    true,
        #else
                    false,
        #endif
            "Compiler is neither GCC nor Clang"
    );
    Data d{.raw{0x69, 0x01}};
    /**
     *
     * 0x69  = 0b0110 1001, 0x01 = 0b0000 0001
     * On a little endian platform, each byte will be laid out in memory in reverse order, that is
     * [1001 0110, 1000 0000].
     * The GCC and clang compilers lay out the struct members in the order they are defined, and each of the values will be interpreted in the reverse order (little-endian), so
     * field1: 100 = 0b001
     * field2: 101 = 0b101
     * field3: 1   = 0b1
     * field4: 01  = 0b10
     * field5: 0000000 = 0
     *
     * Therefore, the following assertions will hold if the preceding assertions were satisfied.
     */
    assert(d.interpreted.field1 == 1);
    assert(d.interpreted.field2 == 5);
    assert(d.interpreted.field3 == 1);
    assert(d.interpreted.field4 == 2);
    assert(d.interpreted.field5 == 0);
}

Uhh. Just store a single `uint16_t` and use `>>` and `&` to extract parts from it. — HolyBlackCat, Apr 28 '23 at 19:57
"_within the limitations of using a supported compiler_": It is not clear what you mean here. The program has undefined behavior (regardless of any choice of implementation-defined behavior) per the standard if `NDEBUG` isn't set. But regardless a compiler such as GCC will allow this program, and assuming the layout is correct, will guarantee expected behavior. "Undefined behavior" just sets the bound of what the compiler doesn't _have_ to support in any particular way to conform with the standard. — user17732522, Apr 28 '23 at 19:58
@user17732522 sorry, tried to express it better. So you think this program will work as long as it is compiled with GCC on a little-endian platform? — user23952, Apr 28 '23 at 20:00
Maybe as long as the behavior of GCC never changes, but it might suddenly stop working as intended with a GCC update. Relying on UB is dangerous for long-term maintenance of an app. It's much safer to use defined behavior with & and >> like @HolyBlackCat suggested — Dave S, Apr 28 '23 at 20:04
@user23952 I don't know for sure. In particular I don't know by heart what the effect of `__attribute__((packed))` on bit fields is. I would have to look it up myself. — user17732522, Apr 28 '23 at 20:05
@DaveS GCC supports the union type punning explicitly. It is somewhere in the documentation. And I am pretty sure the bit field layout is also guaranteed somewhere in the documentation (and that is supposed to be documented anyway since it is _implementation-defined_). — user17732522, Apr 28 '23 at 20:06
@user17732522 - then it sounds like this is not undefined behavior for GCC. If so, the main concern would be if the code ever needs to run when built by another compiler. Some of our old code at work contains Microsoft-specific C++ extensions and functions, and is not undefined behavior as long as we stick with Visual Studio.. — Dave S, Apr 28 '23 at 20:20
Not sure what the "being fast" link is supposed to demonstrate other than this program has no side effects and is optimized to nothing. — Retired Ninja, Apr 28 '23 at 20:22
@DaveS Which would mean that as long as the program has a `static_assert(__GNUC__)`, perhaps with a version range, then it is a correct program, isn't it? With a reference to the relevant part of the GCC documentation, I think it could be justifiably put in production if it offered some critical advantage? The approach by @holyblackcat is of course what I would normally use. — user23952, Apr 28 '23 at 20:27
That's my take. Our code is only for win32 so there's no need for us to worry about the dependency on the VS compiler. Your asserts should catch being moved to an incompatible compiler or CPU architecture. However, as @RetiredNinja notes the compiler can optimize your test code down to nothing so you need to test with data generated at runtime (and a lot of it) to get accurate timing comparisons against & and >> — Dave S, Apr 28 '23 at 20:31
What kind of advantage would this offer? It'll most likely be optimized to the same thing as manual bit manipulation. — HolyBlackCat, Apr 28 '23 at 20:35
The intention is to use this program for large byte strings and of course the final program will have side effects. — user23952, Apr 28 '23 at 20:37
@HolyBlackCat I have a data field of 10 bytes and about 30 fields. They don't follow the character or word boundaries. It is tempting to list the fields and their sized and let the compiler worry about how many bits to shift and so on. — user23952, Apr 28 '23 at 20:43
Undefined behaviour is when *the C++ standard* doesn't guarantee anything about your program. There is nothing wrong with looking for such guarantees elsewhere. — n. m. could be an AI, Apr 28 '23 at 20:47
Then *"has the advantage of being fast"* is somewhat misleading, if the reason you're doing this is convenience, rather than speed. My suggestion would be to write some clever template to automate the bitshifts, then you won't have to worry about portability. — HolyBlackCat, Apr 28 '23 at 20:51
Type punning is supported by GCC: https://gcc.gnu.org/onlinedocs/gcc-13.1.0/gcc/Optimize-Options.html#Type-punning — user23952, Apr 28 '23 at 20:59
@HolyBlackCat it is not misleading, this code is very fast. Your suggested implementation, while also fast if implemented correctly, requires care to be taken not to extract data that is not used. — user23952, Apr 28 '23 at 21:04
Why use `#if` in combination with `static_assert` when `#if !defined(__GNUC__) && !defined(__clang__) #error Compiler is neither GCC nor Clang #endif` would suffice? Also you could check the endianness compiler independent as `static_assert(std::endian::native == std::endian::little);` (C++20 required). — fabian, Apr 28 '23 at 21:21
Consider for a sample usage: `Data d; read_data(d.raw); std::cout << d.interpreted.field4;` this wont spend time reading and copying any other data fields — user23952, Apr 28 '23 at 21:23
Not all developers do it (and some companies enforce mandatory policies forbidding it) but there is no absolute rule preventing use of compiler-specific (or system-specific) features that are not part of standard C++ (e.g. involve some incantation of undefined behaviour, require particular implementation-defined features that not all compilers/libraries support). It is *strongly advisable*, if doing so, to *both* document that fact and test (e.g. detect compiler (or compiler version) using predefined macros, and terminate compilation with an error if required features are not supported). — Peter, Apr 29 '23 at 06:46
Agree with @Peter. This code is correct, but its correctness depends not only on the C++ standard, but also on the implementation-defined features including the application-binary interface (ABI). To introduce code like this in production, those dependencies need to be understood and maintained. They should be both documented and verified during compilation. This is complicated, and usually not worth it. — user23952, Apr 29 '23 at 12:12
*"with the limitation of requiring a supported compiler"* -- what would make a compiler "supported"? Is it one you've tried this code on and gotten the result you wanted, or is it one that has documented that (as an extension to the C++ standard) it will compile your code to give you the result you want? *The latter might be better expressed as "with the limitation of requiring a compiler that supports this code" -- reversing the direction of "support".* — JaMiT, Apr 29 '23 at 16:11
*"Can this program be said to be correct"* -- how are you defining "correct"? You've acknowledged that it is not correct C++ code, so I guess you mean something more like "correct in some compiler's implementation/extension of C++"? — JaMiT, Apr 29 '23 at 16:14
@JaMiT per for example https://en.wikipedia.org/wiki/Correctness_(computer_science), a program is correct if it behaves correctly. This question can therefore be understood as a request to specify under which circumstances the above program is correct. From the discussion above, it looks like this program is correct when compiled with at least some very common compilers, and at least some ABIs. — user23952, Apr 29 '23 at 17:39
@user23952 *"a program is correct if it behaves correctly"* -- then perhaps the question should specify the correct behavior? And even then, your question seems to be "can this program be said to behave correctly if we restrict it to the circumstances in which it behaves correctly?" Unless you were wondering if there were other factors to consider besides compiler and endianness? That could be a valid question, but I don't get it from the current phrasing. Hmm... I'm guessing that another paragraph or two might make the question clearer. — JaMiT, Apr 29 '23 at 17:48
@user23952 *"a program is correct if it behaves correctly"* -- since you are interested in behavior/algorithm, rather than in what specifications say is correct, you may want to drop "in C++" from your question title. And maybe add this definition to your question, to make it clearer what you mean. — JaMiT, Apr 29 '23 at 17:53
This is clear from `static_assert( #if defined(__GNUC__) || defined(__clang__)`. — user23952, Apr 29 '23 at 18:01

score 0 · Answer 1 · answered Apr 28 '23 at 22:31

Your problem made me consider how this could be expressed using the Ada programming language. The following example works as you want yours to work.

with Ada.Text_IO; use Ada.Text_IO;

procedure Main is
   type bit_3 is range 0 .. 2**3 - 1 with
      Size => 3;
   type bit_1 is range 0 .. 1 with
      Size => 1;
   type bit_2 is range 0 .. 3 with
      Size => 2;
   type bit_7 is range 0 .. 2**7 - 1 with
      Size => 7;

   type bit_fields is record
      field1 : bit_3;
      field2 : bit_3;
      field3 : bit_1;
      field4 : bit_2;
      field5 : bit_7;
   end record;

   for bit_fields use record
      field1 at 0 range 0 ..  2;
      field2 at 0 range 3 ..  5;
      field3 at 0 range 6 ..  6;
      field4 at 0 range 7 ..  8;
      field5 at 0 range 9 .. 15;
   end record;

   type unint_8 is range 0 .. 2**8 - 1 with
      Size => 8;

   package unint_io is new Integer_IO (unint_8);
   use unint_io;
   type two_bytes is array (Positive range 1 .. 2) of unint_8;

   foo : two_bytes;
   bar : bit_fields with
      Address => foo'Address;

begin
   foo := (16#69#, 1);
   for value of foo loop
      Put (Item => value, Base => 2);
      Put (" ");
   end loop;
   New_Line;

   Put_Line ("Field1 :" & bit_3'Image (bar.field1));
   Put_Line ("Field2 :" & bit_3'Image (bar.field2));
   Put_Line ("Field3 :" & bit_1'Image (bar.field3));
   Put_Line ("field4 :" & bit_2'Image (bar.field4));
   Put_Line ("field5 :" & bit_7'Image (bar.field5));
end Main;

I started by defining data types for each of the field types you need in your bit fields. Thus, type bit_3 is defined as a 3-bit integer, bit_1 is defined as a 1-bit integer, bit_2 is defined as a 2-bit integer and bit_7 is defined as a 7-bit integer.

I then created a record type named bit_fields specifying the fields as you have specified them in your example. The record representation clause following the record definition clause specifies the precise bit layout of the record, with bit numbers starting from the beginning of byte 0.

I then created a 1-byte integer type named unint_8 with a size of 8 bits. I instantiated a generic I/O package for unint_8 so that I can output values of that type in binary. I created an array type named two_bytes containing two unint_8 elements.

Ada does not have unions as such. I have achieved the same effect by creating an instance of two_bytes named foo followed by an instance of bit_fields named bar, with the address of bar set to the same address as the variable foo.

I assigned the values hex 69 and 1 to the two elements of foo and printed foo in binary followed by printing each field of bar in decimal.

The output of this program is:

2#1101001# 2#1# 
Field1 : 1
Field2 : 5
Field3 : 1
field4 : 2
field5 : 0

This does not answer the question that was asked. The question is about characterizing the program that is in the question. It is not asking for alternative ways to accomplish the same functionality (which might be a follow-up question, but is not this question). — JaMiT, Apr 29 '23 at 16:17
The discussions eventually headed to a question of correctness of the program. I produced a program in the Ada language which is fully correct and produces the results the author asserted in his or her C++ example. My answer therefore addressed whether or not the results could be considered correct, even if C++ cannot be relied upon to produce the correct results in all architectures. — Jim Rogers, Apr 29 '23 at 17:26
Note that results are not the same as the program. You address *"whether or not the results could be considered correct"*, while the question asks *"Can this program be said to be correct"*. Not the same thing. I still maintain that this does not answer the question that was asked. — JaMiT, Apr 29 '23 at 17:40
@JaMiT I find it interesting that you act as a source of authority while providing no answer to the question yourself. I suggest you provide a useful answer for user23952. — Jim Rogers, Apr 29 '23 at 22:53
@JimRogers this was not quite the answer I was looking for, but thanks for providing an interesting perspective. — user23952, Apr 30 '23 at 19:24
The primary difference between your version and the Ada version is the ability in Ada to specify not only the bit size, as is done in a C or C++ bit field, but also to specify the bit position within the block of memory for every field, even when the field crosses byte boundaries. Once the position is specified the compiler must comply as long as all the fields will fit in the memory reserved for the record. If they do not fit the compiler rejects the record definition. — Jim Rogers, Apr 30 '23 at 21:26
@JimRogers *"I find it interesting that you act as a source of authority while providing no answer to the question yourself."* -- No big mystery here. From [answer]: *"Not all questions can or should be answered here."* I believe this question, in its current form, falls into the category of questions that should not be answered. So I have not answered it. *"I suggest you provide a useful answer for user23952."* -- Thank you for your suggestion, but no, I decline. Not until the question is improved. — JaMiT, May 06 '23 at 02:59

Undefined but correct behavior in C++

1 Answers1