12

The majority of microcomputer C compilers have two signed integer types with the same size and representation, along with two such unsigned types. If int is 16 bits, its representation will generally match short; if long is 64 bits, it will generally match long long; otherwise, int and long will usually have matching 32-bit representations.

If, on a platform where long, long long, and int64_t have the same representation, one needs to pass a buffer to three API functions in order (assume the APIs under the control of someone else and use the indicated types; if the functions could readily be changed, they could simply be changed to use the same type throughout).

void fill_array(long *dat, int size);
void munge_array(int64_t *dat, int size);
void output_array(long long *dat, int size);

is there any efficient standard-compliant way of allowing all three functions to use the same buffer without requiring that all of the data be copied between function calls? I doubt the authors of C's aliasing rules intended that such a thing should be difficult, but it is fashionable for "modern" compilers to assume that nothing written via long* will be read via long long*, even when those types have the same representation. Further, while int64_t will generally be the same as either long or long long, implementations are inconsistent as to which.

On compilers that don't aggressively pursue type-based aliasing through function calls, one could simply cast pointers to the proper types, perhaps including a static assertion to ensure that all types have the same size. The problem is that if a compiler like gcc, after expanding out function calls, sees that some storage is written as long and later read as long, without any intervening writes of type long, it may replace the later read with the value written as type long, even if there were intervening writes of type long long.

Disabling type-based aliasing altogether is of course one approach to making such code work. Any decent compiler should allow that, and it will avoid many other possible pitfalls. Still, it seems like there should be a Standard- defined way to perform such a task efficiently. Is there?

supercat
  • 77,689
  • 9
  • 166
  • 211
  • Perhaps `long long datll[size]; fill_array(MY_LLP_LP(datll), size);` and let the macro check/handle the conversions? – chux - Reinstate Monica Sep 08 '16 at 17:47
  • @chux: What would you be expecting the macro to do? Bear in mind that zero machine instructions should be necessary to perform the conversions--the only requirement is that it prevent the compiler from "optimizing" later functions to not see the data written by earlier ones. – supercat Sep 08 '16 at 17:50
  • How common is it to pass a buffer as a pointer to some integer type? Isn't `char*` or `void*` much more common? What library are you encountering with this profile? Also, wouldn't it be trivial to wrap the functions with a standard type? (ie: thunking) – ebyrob Sep 08 '16 at 18:02
  • Goal is not conversion of an integer type to another integer, but conversion of an integer pointer to another pointer. Suggest a macro to do that with a pointer cast. Use `#if` processing and `_Static_assert` as able to insure the simple cast will suffice. – chux - Reinstate Monica Sep 08 '16 at 18:06
  • @ebyrob, it's not unheard of to pass around buffers of elements of declared type other than character types. It's less common in code that aims to be highly portable, but even in codes such as those, buffers of explicit-width types such as `int64_t` are sometimes seen. It all depends on what you're trying to represent. – John Bollinger Sep 08 '16 at 18:52
  • @chux: Goal is to make integers which were written using one type readable with another pointer type that has the same representation. Both gcc and clang interpret pointer-aliasing rules as allowing compilers to ignore aliasing between different integer types, even when they have the same representation. – supercat Sep 08 '16 at 18:56
  • @supercat: I'm more than a little hazy on the semantics of the C aliasing rules but I don't suppose there's any chance that a dummy in-place `memmove` of the buffer followed by casting would be well-defined and optimized by "decent" compilers? – doynax Sep 08 '16 at 19:24
  • @doynax: If the destination of `memcpy` or `memmove` has no declared type, its Effective Type will become that of the source--a rule whose primary effect is to make `memcpy`/`memmove` useless for scrubbing effective types. Compilers would probably also be entitled to apply the same effective-type transference to something like `for (size_t i=0; i – supercat Sep 08 '16 at 19:54
  • @doynax: And incidentally, an attempt to use `memmove` to copy an object to itself will indeed be stripped out by gcc while leaving the Effective Type of the buffer unchanged. – supercat Sep 08 '16 at 19:56
  • @supercat: _Grumble_.. Did the standard body just decide to sit down one day and ask themselves how they might define the language to maximize the number of subtle and insidious bugs through undefined behavior, while creating the absolute minimum number of escape hatches possible and documenting the intended well-defined uses as unclearly as possible? Honestly, how hard would it have been to define some function or specifier to declare intentional type punning to the compiler and reader.. – doynax Sep 08 '16 at 20:28
  • @doynax: C89 is a decent spec if one recognizes that it does not claim to specify everything necessary to make something be a *quality* implementation for any particular platform, and recognizes that if programmers could expect quality pre-C89 implementations for a given platform to behave a certain way, they should be able to expect quality C89 implementations for that platform to do likewise except when absolutely forbidden (e.g. even if pre-C89 implementations promoted 8-bit `unsigned char` to 16-bit `unsigned int`, C89 implementations would be required to promote to `signed int`). – supercat Sep 08 '16 at 20:35

2 Answers2

6

is there any efficient standard-compliant way of allowing all three functions to use the same buffer without requiring that all of the data be copied between function calls? I doubt the authors of C's aliasing rules intended that such a thing should be difficult, but it is fashionable for "modern" compilers to assume that nothing written via long* will be read via long long*, even when those types have the same representation.

C specifies that long and long long are different types, even if they have the same representation. Regardless of representation, they are not "compatible types" in the sense defined by the standard. Therefore, the strict aliasing rule (C2011 6.5/7) applies: an object having effective type long shall not have its stored value accessed by an lvalue of type long long, and vise versa. Therefore, whatever is the effective type of your buffer, your program exhibits undefined behavior if it accesses elements both as type long and as type long long.

Whereas I concur that the authors of the standard did not intend that what you describe should be hard, they also have no particular intention to make it easy. They are concerned above all with defining program behavior in a way that as much as possible is invariant with respect to all of the freedoms allowed to implementations, and among those freedoms is that long long can have a different representation than does long. Therefore, no program that relies on them having the same representation can be strictly conforming, regardless of the nature or context of that reliance.

Still, it seems like there should be a Standard- defined way to perform such a task efficiently. Is there?

No. The effective type of the buffer is its declared type if it has one, or otherwise is defined by the manner in which its stored value was set. In the latter case, that might change if a different value is written, but any given value has only one effective type. Whatever its effective type is, the strict aliasing rule does not allow for the value to be accessed via lvalues both of type long and of type long long. Period.

Disabling type-based aliasing altogether is of course one approach to making such code work. Any decent compiler should allow that, and it will avoid many other possible pitfalls.

Indeed, that or some other implementation-specific approach, possibly including It Just Works, are your only alternatives for sharing the same data among the three functions you present without copying.

Update:

Under some restricted circumstances there may be a somewhat more standard-based solution. For example, with the specific API functions you designated, you could do something like this:

union buffer {
    long       l[BUFFER_SIZE];
    long long ll[BUFFER_SIZE];
    int64_t  i64[BUFFER_SIZE]; 
} my_buffer;

fill_array(my_buffer.l, BUFFER_SIZE);
munge_array(my_buffer.i64, BUFFER_SIZE);
output_array(my_buffer.ll, BUFFER_SIZE);

(Props to @Riley for giving me this idea, though it differs a bit from his.)

Of course that doesn't work if your API dynamically allocates the buffer itself. Note, too, that

  • A program using that approach may conform to the standard, but if it assumes the same representation for long, long long, and int64_t then it still does not strictly conform, as the standard defines that term.

  • The standard is a bit inconsistent on this point. Its remarks about allowing type punning via a union are in a footnote, and the footnotes are non-normative. The reinterpretation described in that footnote seems to contradict paragraph 6.5/7, which is normative. I prefer to keep my mission-critical code far away from uncertainties such as this, for even if we conclude that this approach should work, the uncertainty provides just the kind of cranny that compiler bugs like to lodge in.

  • A rather well-known figure in the field once had this to say about the issue:

Unions are not useful [for aliasing], regardless of what silly language lawyers say, since they are not a generic method. Unions only work for trivial and largely uninteresting cases, and it doesn't matter what C99 says about the issue, since that nasty thing called "real life" interferes.

John Bollinger
  • 160,171
  • 8
  • 81
  • 157
  • 1
    I do not dispute that the Standard allows for the possibility that an implementation could document different representations for `long` and `long long`, even if both types had the same size. On many implementations, however, the representations of `long` and `long long` are documented and they match. The question is whether there is any way to exchange the data without relying upon anything beyond the documented representations. – supercat Sep 08 '16 at 18:59
  • @supercat, I have answered that question. No. The rest of the answer is a discussion of what parts of the standard yield that conclusion, and of why the standard does not provide a mechanism such as you are looking for. – John Bollinger Sep 08 '16 at 19:02
  • So is the only way to make the code portable to write a silly loop which reads each word as one type and then writes the same data back as another, and hope that the compiler manages to omit the instructions which would do the loads and stores, but still reliably recognize that the effective type has changed (gcc 6.2 sometimes omits such load/store operations but fails to recognize that the effective type changes). – supercat Sep 08 '16 at 19:06
  • 1
    @supercat, if you are stuck with a combination of interfaces such as you describe, and you are willing to rely on the representations of `long` and `long long` to be the same in your chosen implementation (which presumably you can check in its documentation), then I don't see what's to be gained by avoiding further reliance on your implementation's specific features. With GCC, for instance, I'd consider just casting the pointers and turning on `-fno-strict-aliasing` if the type representations really did match. – John Bollinger Sep 08 '16 at 19:14
  • @supercat Type punning is allowed using a union. [See my answer](http://stackoverflow.com/a/39396889/6697083) – Riley Sep 08 '16 at 19:17
  • 1
    @Riley: Compilers like gcc will only recognize type punning through a union if the lvalue accesses use the union type directly. Taking the addresses of union members and then using those as pointers to the individual member types won't work. – supercat Sep 08 '16 at 19:59
  • @supercat gcc doesn't give any errors, and my (basic) test worked properly with the code in my answer (I have the functions just print out the value passed in). What else would be the problem? – Riley Sep 08 '16 at 20:05
  • 1
    @Riley: See https://godbolt.org/g/S1k9E9 for a demonstration of gcc 6.2's failure to recognize aliasing of accesses to arrays that are part of a union. – supercat Sep 08 '16 at 20:24
  • @supercat My assembly is a little rusty. What's the problem? – Riley Sep 08 '16 at 20:38
  • 1
    @Riley: The code for test3 is pretty simple: return 1 unconditionally, even though it would return 3 if the compiler recognized the aliasing between the storage at p->v1 and p->v2. – supercat Sep 08 '16 at 20:48
  • @supercat I thought it was weird that it never called blah3. Does it see `p->v1` and `p->v2` as two different things, so it can optimize away all of the calls because `p->v1` is only every assigned `1`? – Riley Sep 08 '16 at 21:00
  • 1
    @Riley: That's precisely the problem. As far as I'm concerned, gcc's default mode implements a subset of Dennis Ritchie's language, and is unsuitable for any code that will ever need to reuse storage without going through a malloc/free cycle (as of 6.2 it's not reliable if storage gets uses as `long` and then as `long long`, even if storage is never read using any type other than the one with which it was written). – supercat Sep 08 '16 at 21:18
  • 1
    @JohnBollinger: Did you see the godbolt link? Beyond the fact that such a pattern would require specifying a hard-coded maximum buffer size, gcc 6.2 doesn't recognize the references to `l`, `ll`, or `i64` as changing the active member of the array. – supercat Sep 08 '16 at 21:22
  • 1
    @JohnBollinger: It may be fair to note, with regard to the Linus Torvalds quote, that he wasn't saying unions are generally useless, but rather that it is generally not practical to encapsulate everything that might alias within an actual union object. – supercat Sep 08 '16 at 22:09
  • 1
    @supercat, with respect to gcc 6.2, then, it seems Linus was right. I have edited my answer a bit to clarify the context of the quotation. – John Bollinger Sep 09 '16 at 13:16
  • @JohnBollinger: I like your edits there. Interestingly, there are two ways "real life" intervenes: real-life data formats can often not be mapped to unions, and real compilers like gcc don't always work even when using unions (and even memmove!) – supercat Sep 09 '16 at 14:53
0

You can try doing it with macros. The sizeof operator is not available to the C preprocessor, but you can compare INT_MAX:

#include <limits.h>

#if UINT_MAX == USHRT_MAX
#  define INT_BUFFER ((unsigned*)short_buffer)
#elif UINT_MAX == ULONG_MAX
#  define INT_BUFFER ((unsigned*)long_buffer)
#elif UINT_MAX == ULLONG_MAX
#  define INT_BUFFER ((unsigned*)long_long_buffer)
#else /* Fallback. */
  extern unsigned int_buffer[BUFFER_SIZE];
#  define INT_BUFFER int_buffer
#endif

This is a C question, but in C++, you could do this in a fancier way with template specialization and the type trait templates.

Davislor
  • 14,674
  • 2
  • 34
  • 49
  • The difficulty is that "modern" C compilers will assume that if one function accesses some storage using a pointer of type `long*` and another accesses storage using a `long long*`, the functions can't possibly be accessing the same storage, even if the types have the same layout and representation, and even if it should be obvious to the compiler that aliasing would be likely. – supercat Sep 08 '16 at 22:22
  • @supercat Fair enough, although `void*` might work for that. The correct way to type-pun like this is with a union anyway. – Davislor Sep 09 '16 at 00:06
  • 1
    Using `void*` doesn't help, since the problem isn't one of ensuring that compilers allow the syntax, but ensuring that they don't use the aliasing rule to justify assumptions that writes to one pointer won't affect the target of another. – supercat Sep 09 '16 at 14:13
  • I’m pretty sure most compilers can tell that `(int*)p` and `(long*)p` are aliases, but a specific example might help. In a single-threaded program, not intermixing aliases might be your solution, and of course a multi-threaded program sharing this data needs a more robust solution anyway. – Davislor Sep 09 '16 at 17:28