6

The C++ reference has the following explanation for unions, with the interesting part for this question in bold:

The union is only as big as necessary to hold its largest data member. The other data members are allocated in the same bytes as part of that largest member. The details of that allocation are implementation-defined, and it's undefined behavior to read from the member of the union that wasn't most recently written. Many compilers implement, as a non-standard language extension, the ability to read inactive members of a union.

Now, if I compile on Linux Mint 18 with g++ -std=c++11 the following code, I get the following output (given by comments next to the printf statements):

#include <cstdio>
using namespace std;

union myUnion {
    int var1; // 32 bits
    long int var2; // 64 bits
    char var3; // 8 bits
}; // union size is 64 bits (size of largest member)

int main()
{
    myUnion a;
    a.var1 = 10;
    printf("a is %ld bits and has value %d\n",sizeof(a)*8,a.var1); // ...has value 10
    a.var2 = 123456789;
    printf("a is %ld bits and has value %ld\n",sizeof(a)*8,a.var2); // ...has value 123456789
    a.var3 = 'y';
    printf("a is %ld bits and has value %c\n",sizeof(a)*8,a.var3); // ...has value y
    printf("a is %ld bits and has value %ld\n",sizeof(a)*8,a.var2); //... has value 123456789, why???
    return 0;
}

On the line before return 0, reading a.var2 gives not the ASCII decimal of the 'y' character (which is what I expected, I'm new to unions) but the value with which it was first defined. Based on the above quote from cppreference.com, am I to understand that this is undefined behaviour in the sense that it is not standard, but rather GCC's particular implementation?

EDIT

As pointed out by the great answers below, I made a copying mistake in the comment after the printf statement just before return 0. The correct version is:

 printf("a is %ld bits and has value %ld\n",sizeof(a)*8,a.var2); //... has value 123456889, why???

i.e. the 7 changes to an 8, because the first 8 bits are overwritten with the ASCII value of the 'y' character, i.e. 121 (0111 1001 in binary). I'll leave it as it is in the above code to stay coherent with the great discussion that resulted from it, though.

TylerH
  • 20,799
  • 66
  • 75
  • 101
space_voyager
  • 1,984
  • 3
  • 20
  • 31
  • 3
    Correct, GCC decided that this would be a useful feature, but compilers are not required to do that - and they could infact crash or just give you nasal demons. – UKMonkey Oct 18 '16 at 16:53
  • 3
    The behavior is undefined. That means **only** that the C++ language definition doesn't tell you what that code does. – Pete Becker Oct 18 '16 at 16:53
  • As a side note if your union is a standard layout POD, you can push the "conversion" into a function in a separate translation unit, and you build + link that as pure C: you'd be in the realm of defined behavior, since the C standard is perfectly fine with type punning this way. – StoryTeller - Unslander Monica Oct 18 '16 at 17:03
  • 1
    no, on the last line, the output is `123456889` because it's the hex pattern `79cd5b0700000000` instead of `15cd5b0700000000` (`123456789`) - the `y` turned a `0x15` into `0x79`, which reflects the difference in the output. That said, undefined behaviour is undefined. – Anya Shenanigans Oct 18 '16 at 17:05

4 Answers4

4

The fun thing about undefined behavior is that it's very specifically not the same as "random" behavior. Compilers will have a behavior that they decide to use when dealing with undefined behavior, and tend to exhibit the same behavior every time.

Case in point: IDEOne has its own interpretation of this code: http://ideone.com/HO5id6

a is 32 bits and has value 10
a is 32 bits and has value 123456789
a is 32 bits and has value y
a is 32 bits and has value 123456889

You might notice something kind of funny happened there (setting aside the fact that for IDEOne's compiler, long int is 32 bits and not 64 bits). It still shows line 4 as reading similarly to line 2, but the value has actually changed slightly. What appears to have happened is that the char value of 'y' was set in the union, but it didn't alter any of the other bits. I got similar behavior when I switched it to long long int instead of long int.

You may want to check if, in your example, line 4 is exactly the same as it was before. I'm a little skeptical that that's actually the case.

At any rate, to answer your specific question, the TL;DR is that in GCC, writing to a union only alters the bits associated with the specific member you're writing to, and it's not guaranteed to alter/clear all the other bits. And of course, like anything UB-related, make no assumptions that any other compiler (or even later versions of the same compiler!) will behave the same.

Xirema
  • 19,889
  • 4
  • 32
  • 68
  • Indeed, I missed the slight change! I get the same output as you :) – space_voyager Oct 18 '16 at 17:38
  • 2
    _"Compilers will have a behavior that they decide to use when dealing with undefined behavior, and tend to exhibit the same behavior every time."_ Not necessarily. – Jonathan Wakely Oct 18 '16 at 17:39
  • @JonathanWakely Tendency is not a hard constraint ;) – space_voyager Oct 18 '16 at 17:46
  • 1
    I think that the standard uses 'implementation defined' for things which one can expect a given compiler to be consistent about. – jbcoe Oct 18 '16 at 18:39
  • The chief problem is that compilers typically do not have a "specific" behavior for Undefined Behavior. They simply do not think about it. If your code causes Undefined Behavior, it usually means you hit cases not considered by your compiler writer. – MSalters Oct 19 '16 at 07:29
  • @jbcoe: _Unspecified_ behavior is also fairly consistent; implementation-defined is still specified but unportable behavior – MSalters Oct 19 '16 at 07:30
3

You're printing just parts of the same region of memory:

myUnion a;
a.var2 = -1;
printf("a is %ld bits and has value %ld = 0x%lx\n",
    sizeof(a)*8, a.var2, a.var2);
a.var3 = 'y';
printf("a is %ld bits and has value %c = 0x%x\n",
    sizeof(a)*8, a.var3, a.var3);
printf("a is %ld bits and has value %ld = 0x%lx\n",
    sizeof(a)*8, a.var2, a.var2);

Sample output

a is 64 bits and has value -1 = 0xffffffffffffffff
a is 64 bits and has value y = 0x79
a is 64 bits and has value -135 = 0xffffffffffffff79

I've replaced your 123456789 with the maximum value just for the sake of clarity. The same applies to your number as well:

a is 64 bits and has value 123456789 = 0x75bcd15
a is 64 bits and has value y = 0x79
a is 64 bits and has value 123456889 = 0x75bcd79

Again, the first byte of the original value (particularly, 0x15) is replaced with 0x79 (the y character), so the original number is modified.

Obviously, a.var2 is cast to long int of the whole region of memory, a.var3 - to char, i.e. just the first byte of the union memory.

Visualization:

           long int (64)        = (long int) u
           ****************************************************************
           int (32) = a.var2    = (int) u
           ********************************
           char (1) = a.var1    = (char) u
           *
Byte no.:  0 ........................................................... 63
           ^
          ('y' = 0x79) (0xcd) (0x5b) (0x07)

The lines in the docs actually mean that the last assignment to a union member specifies the value of the union, the rest of the memory is considered a garbage. Although, we usually can observe the leftovers in the memory allocated for the union as a whole.

Ruslan Osmanov
  • 20,486
  • 7
  • 46
  • 60
2

For what it is worth, the C11 standard §6.5.2.3, note 95 (page 83) says:

If the member used to read the contents of a union object is not the same as the member last used to store a value in the object, the appropriate part of the object representation of the value is reinterpreted as an object representation in the new type as described in 6.2.6 (a process sometimes called ‘‘type punning’’). This might be a trap representation.

Which is what I am seeing, even when compiled as C++11 (with Apple LLVM version 8.0.0 (clang-800.0.38)):

a is 64 bits and has value 10
a is 64 bits and has value 123456789
a is 64 bits and has value y
a is 64 bits and has value 123456889

Note that the last value is not 123456789, but 123456889 as the least significant byte was overwritten by

a.var3 = 'y';

Which replaced 0x15 with 0x79 (== 'y').

Daniel
  • 66
  • 5
2

Are you sure you get what you write?

In ubuntu 64 bits with GCC 5.4.0 I get:

a is 64 bits and has value 10
a is 64 bits and has value 123456789
a is 64 bits and has value y
a is 64 bits and has value 123456889

var2 is 64 bit size and by changing var3 value you are modifying the last byte of var2. It's clearer when you print using %x:

a is 64 bits and var1 has value a
a is 64 bits and var2 has value 75bcd15
a is 64 bits and var3 has value 79
a is 64 bits and var2 has value 75bcd79

var1, var2 and var3 have the same memory direction and as your architecture is Little Endian for most computers (Intel/Amd) modifying var3 changes the less significant byte of var2 and var1 as they share the same memory address.

Guillaume Racicot
  • 39,621
  • 9
  • 77
  • 141