8

This is what I offered at an interview today.

int is_little_endian(void)
{
    union {
        long l;
        char c;
    } u;

    u.l = 1;

    return u.c == 1;
}

My interviewer insisted that c and l are not guaranteed to begin at the same address and therefore, the union should be changed to say char c[sizeof(long)] and the return value should be changed to u.c[0] == 1.

Is it correct that members of a union might not begin at the same address?

sigjuice
  • 28,661
  • 12
  • 68
  • 93

8 Answers8

8

I was unsure about the members of the union, but SO came to the rescue.

The check can be better written as:

int is_bigendian(void) {
    const int i = 1;
    return (*(unsigned char*)&i) == 0;
}

Incidentally, the C FAQ shows both methods: How can I determine whether a machine's byte order is big-endian or little-endian?

Community
  • 1
  • 1
Sinan Ünür
  • 116,958
  • 15
  • 196
  • 339
  • I believe the hairy pointer casting is technically undefined behavior, but I couldn't cite anything, and it should certainly work on most machines. – Chris Lutz Aug 20 '09 at 02:53
  • 2
    I'd be surprised if it were undefined; otherwise how would memcpy and most serialization code work? – Crashworks Aug 20 '09 at 03:05
  • 2
    @Chris I believe you have it reversed. Converting from a `char *` to `int *` can cause undefined behavior. I have a copy of the WG14/N1124 draft and if things haven't changed since then: *When a pointer to an object is converted to a pointer to a character type, the result points to the lowest addressed byte of the object.* (p.47, http://www.open-std.org/JTC1/SC22/wg14/www/docs/n1124.pdf) – Sinan Ünür Aug 20 '09 at 03:08
  • Okay. I don't have a copy (I'll get around to it one day) but I remembered hearing that the same trick from `float` to `int` in the Quake inverse square root function was undefined. I suppose converting between `char`s and `int`s is much more predictable, and thus defined. – Chris Lutz Aug 20 '09 at 03:20
  • @Chris clarification: Converting from a `char *` to `int *` would be undefined behavior if the two have different alignment requirements. But converting from any pointer type to `char *` is safe. – Sinan Ünür Aug 20 '09 at 03:28
  • 4
    @Chris: char is actually a special case in the standard, as a way of accessing the underlying representation of the other types. – caf Aug 20 '09 at 05:37
  • 2
    @CHris: "Hairy pointer casts", aka raw memory reinterpretation, are generally UB, *except* if you reinterpret it as an array of characters. The latter is explictly allowed in C. However, when `char` is used (as opposed to `unsigned char`) the set of things you can do with reinterpreted memory is limited. The above code is generally UB, since it is UB to read the value through such a `char *` pointer - the value might be a trap representation. The proper code should have used a cast to `unsigned char*`. – AnT stands with Russia Oct 21 '09 at 16:53
  • 1
    @caf: That would be `unsigned char`, not `char`. – AnT stands with Russia Oct 21 '09 at 17:02
6

You are correct in that the "members of a union might begin at the same address". The relevant part of the Standard is (6.7.2.1 para 13):

The size of a union is sufficient to contain the largest of its members. The value of at most one of the members can be stored in a union object at any time. A pointer to a union object, suitably converted, points to each of its members (or if a member is a bit-field, then to the unit in which it resides), and vice versa.

Basically, a start address of the union is guaranteed to be the same as the start address of each of its members. I believe (still looking for the reference) that a long is guaranteed to be larger than a char. If you assume this, then your solution should* be valid.

* I'm still a little uncertain due to some interesting wording around the representation of integer and, in particular, signed integer types. Take a close read of 6.2.6.2 clauses 1 & 2.

D.Shawley
  • 58,213
  • 10
  • 98
  • 113
3

While your code would probably work in many compilers the interviewer is right -- how to align fields in a union or struct is entirely up to the compiler and in this case the char could be placed either at the "beginning" or the "end". The interviewer's code leaves no room for doubt and is guaranteed to work.

Kristoffon
  • 622
  • 3
  • 7
1

The standard says the offsets for each item in a union are implementation defined.

When a value is stored in a member of an object of union type, the bytes of the object representation that do not correspond to that member but do correspond to other members take unspecified values. ISO/IEC 9899:1999 Representation of Types 6.5.6.2, para 7 (pdf file)

Therefore it's up to the compiler to choose where to put the char relative to the long within the union- they are not guaranteed to have the same address.

Dana the Sane
  • 14,762
  • 8
  • 58
  • 80
fbrereto
  • 35,429
  • 19
  • 126
  • 178
  • 4
    There is one exception here. A little further down (6.7.2.1 para 13): "The size of a union is sufficient to contain the largest of its members. The value of at most one of the members can be stored in a union object at any time. _A pointer to a union object, suitably converted, points to each of its members_ (or if a member is a bit-field, then to the unit in which it resides), and vice versa." Basically, a start address of the union is guaranteed to be the same as the start address of each of its members. – D.Shawley Aug 20 '09 at 03:10
  • Good point, I'll cease meddling with fbrereton's question. I am confused now though, because if you're right, than the code in the question should work. – Dana the Sane Aug 20 '09 at 03:24
  • The OP's code is fine: See http://stackoverflow.com/questions/891471/union-element-alignment – Sinan Ünür Aug 20 '09 at 03:42
  • I'm pretty sure that it will work and is guaranteed to do so. See my answer... I was sorta surprised by this one. – D.Shawley Aug 20 '09 at 03:42
0

I have a question about this...

how is

u.c[0] == anything

valid given:

union {
    long l;
    char c;
} u;

How does [0] work on a char?

Seems to me, it would be equivalent to: (*u.c + 0) == anything, which would be, well, crap, considering the value of u.c, treated as a pointer, would be crap.

(Unless perhaps, as it occurs to me now, some html crap code ate an ampersand in the original question...)

smcameron
  • 2,547
  • 1
  • 19
  • 12
0

While the interviewer is correct and this is not guaranteed to work by the spec, none of the other answers are guaranteed to work either, as dereferencing a pointer after casting it to another type yields undefined behavior.

In practice, this (and the other answers) will always work, as all compilers allow casting between pointer-to-union and pointer-to-member-of-union transparently -- much ancient code will fail to work if they did not.

Sinan Ünür
  • 116,958
  • 15
  • 196
  • 339
Chris Dodd
  • 119,907
  • 13
  • 134
  • 226
  • Neither clang nor gcc will reliably handle any accesses to non-character-type union members which involve taking the address and dereferencing them, unless the access takes the form of an array-element access using bracketed subscript notation. Even a statement like `*(myUnion.intArray+i) = 23;` will not be recognized as potentially affecting the value of `*(myUnion.floatArray+j)`. – supercat Sep 14 '22 at 20:31
0

correct me if I am wrong but local variables are not initialized to 0;

this is not better:

union {
    long l;
    char c;
} u={0,};
Mandrake
  • 363
  • 1
  • 3
  • 11
0

A point not yet mentioned is that the standard explicitly allows for the possibility that integer representations may contain padding bits. Personally I wish the standards committee would allow a nice easy way for a program to specify certain expected behaviors, and require that any compiler must either honor such specifications or refuse compilation; code which started with an "integers must not have padding bits" specification would then be entitled to assume that to be the case.

As it is, it would be perfectly legitimate (albeit odd) for an implementation to store 35-bit long values as four 9-bit characters in big-endian format, but use the LSB of the first byte as a parity bit. Under such an implementation, storing 1 into a long could cause the parity of the overall word to become odd, thus compelling the implementation to store a 1 into the parity bit.

To be sure, such behavior would be odd, but if architectures that use padding are sufficiently notable to justify explicit provisions in the standard, code which would break on such architectures can't really be considered truly "portable".

The code using union should work correctly on all architectures which can be simply described as "big-endian" or "little-endian" and do not use padding bits. It would be meaningless on some other architectures (and indeed the terms "big-endian" and "little-endian" could be meaningless too).

supercat
  • 77,689
  • 9
  • 166
  • 211