5

Finding the size of some data in bytes is a common operation.

Contrived example:

char *buffer_size(int x, int y, int chan_count, int chan_size)
{
    size_t buf_size = x * y * chan_count * chan_size;  /* <-- this may overflow! */
    char *buf = malloc(buf_size);
    return buf;
}

The obvious error here is the ints will overflow (an 23171x23171 RGBA byte buffer for eg).

What are the rules for promotion when multiplying 3 or more values?
(Multiplying a pair of values is simple)

We could play it safe and just cast:

size_t buf_size = (size_t)x * (size_t)y * (size_t)chan_count * (size_t)chan_size;

Another alternative is to add in parenthesis to ensure the order of multiplication & promotion is predictable (and automatic promotion between pairs works as expected) ...

size_t buf_size = ((((size_t)x * y) * chan_count) * chan_size;

... which works, but my question is.


Is there a deterministic way to multiply 3 or more values to ensure they will automatically promoted?
(to avoid overflow)

Or is this undefined behavior?


Notes...

  • Using size_t here won't prevent overflow, it just prevents overflowing the maximum value for that type.
  • In the example given, it would make sense to have the arguments could be size_t too, but thats not the point of this question.
ideasman42
  • 42,413
  • 44
  • 197
  • 320
  • 1
    Consider making all the arguments `size_t`, avoiding a need to cast in the function. If you go with `int` arguments, consider ensuring they're all non-negative if not positive before multiplying -- but then you probably need 3 casts to be safe and 4 to be systematic/consistent. – Jonathan Leffler May 18 '15 at 23:23
  • @Jonathan Leffler, Yes, in this example your right. But this isn't the point of the question - the input may be `int` for any number of reasons - maybe the image format uses ints - for example. – ideasman42 May 18 '15 at 23:24
  • 1
    That was the first sentence; it isn't an option. Therefore, you move on to the second sentence. – Jonathan Leffler May 18 '15 at 23:25
  • *"but then you probably need 3 casts"* - I like to be more concise then **probably**, If this is compiler specific *(undefined)* behavior. Thats fine. Then its best to just add in the casts. But if there is a memorable rule in the C spec. Its good to be aware of that so as to flag potential bugs when auditing code. – ideasman42 May 18 '15 at 23:29
  • It depends on the risks you want to take. Do you think there's a chance a compiler could make a wrong decision if you omit some casts? If so, don't risk it. Modern compilers seem to want to go out of their way to break code when there's a chance of integer overflow; I wouldn't risk it. The standard probably says "you should be OK", but I haven't ensured that because I don't like taking the risk. That's why I've made comments, not an answer. It isn't definitive enough to be an answer. If someone comes up with a quote-the-standard answer that makes it clear you're safe (or unsafe), great. – Jonathan Leffler May 18 '15 at 23:32
  • @Jonathan Leffler, understood. This has been my approach too (just add in the casts to avoid possible bugs/ambiguity). But I like to know for sure whats expected, since often I read other developers code, if they leave out the casts *(and the C spec ensures correct output)*, then its best not to cause unnecessary work by proposing to **fix** whats not broken. – ideasman42 May 18 '15 at 23:34
  • if you want to have a general method of re-arranging arguments to avoids overflows, there is none, except if you incrorporate domain-specific knowledge (e..g `y` is always greater than `x`, `x`, `y` are in a certain range etc..). Otherwise typecast to make sure – Nikos M. May 18 '15 at 23:52
  • Most likely, ISO/IEC 9899:2011 Section 6.5 Expressions, Para 3 applies: _The grouping of operators and operands is indicated by the syntax. 85) Except as specified later, side effects and value computations of subexpressions are unsequenced. 86)_ where the numbers are footnotes (86 is not immediately relevant; 85 is long but might be relevant, but footnotes are not normative). Parenthesized expressions are at the top of the hierarchy. The question is can the optimizer optimize around those -- I don't recall what the rules mean or what the consensus is on the issue. – Jonathan Leffler May 18 '15 at 23:52
  • @JonathanLeffler: The optimizer can rearrange expressions only when it won't make a difference (as usual). Grouping is indicated by the syntax means that `a+b+c` is syntactically equivalent to `((a+b)+c)`, and therefore `(size_t)a + b + c` is syntactically equivalent to `((((size_t)a) + b) + c)`. That is *not* the same as `a + b + (size_t)c`; here, `a+b` will be computed using signed semantics (i.e. overflow is undefined). But it is exactly the same as `(size_t)a + (size_t)b + (size_t)c`. – rici May 19 '15 at 04:54

1 Answers1

6

In C (and C++), the type of an arithmetic operator is determined as follows:

  1. Both operands are converted to the same type, using the "usual arithmetic conversions".

  2. That's the type of the result.

Many binary operators that expect operands of arithmetic or enumeration type cause conversions and yield result types in a similar way. The purpose is to yield a common type, which is also the type of the result. This pattern is called the usual arithmetic conversions [Note 1] [Note 2]

There is no other rule, so there is no special case for expressions with two or more operators. Each operation is typed independently, according to the syntax.

The result type is not automatically widened in order to avoid or reduce the probability of overflow; the operands are both converted to a common type "which is also the type of the result". So if you multiply two ints, the result will be an int and overflow will result in undefined behaviour. [Note 3]

The syntax of the language(s) precisely defines how a full expression is grouped, and evaluation is required to conform to the syntax. The expression a + b + c must have the same result as the expression (a + b) + c, because the syntax requires that grouping. The compiler may rearrange the computation as it sees fit, provided it can demonstrate that the result is semantically identical for all valid inputs. But it cannot decide to change the result types of any operators. a + b + c must have the type which results from applying the usual arithmetic conversions to the types of a and b, and then applying them again to that type and the type of c. [Note 4]

The usual arithmetic conversions are detailed in §6.3.1.8 ("Usual arithmetic conversions") of the C standard, and in paragraph 10 of the introduction to §5 (Expressions) of C++. Roughly speaking, it goes like this:

  1. If both operands are floating point, both operands are converted to the wider of the two types; if one operand is floating point, the other is converted to that floating point type.

  2. Otherwise, if both operands are signed integral types, they are both converted to the widest of the two types and int.

  3. Otherwise, if both operands are unsigned integral types at least as large as unsigned int, they are both converted to the wider of the two types.

[Note 5]

Now, take the case of a * b * c * d, where a, b, c and d are all int and the desire is to produce a size_t.

Syntactically, that expression is equivalent to (((a * b) * c) * d), and the usual arithmetic conversions are applied accordingly operation by operation. If you convert a to size_t with a cast ((size_t)a * b * c * d), the conversions will be applied as though it were parenthesized. So the operands and the result of (size_t)a * b would be size_t, and therefore so will be the result of (size_t)a * b * c and thus (size_t)a * b * c * d. In other words, all the operands will be converted to unsigned size_t values and all the multiplications will be performed as unsigned size_t multiplications. That's well-defined but probably meaningless if any of the values happen to be negative.

Either the second or the third multiplication could exceed the capacity of a size_t, but since size_t is unsigned, the computation will be performed modulo 2N where N is the number of value bits in size_t. The cast, therefore, is not safe in the sense that it avoids overflow, but it does at least avoid undefined behaviour.


Notes

  1. The quote is from the C++ standard, §5, paragraph 10. The C standard has a slightly more complicated version in §6.3.1.8, because C11 includes complex arithmetic types. For integer (and non-complex floating-point) operands, C and C++ have identical semantics.

  2. The shift operators are exceptions, which is why it says "many binary operators". The result type of a shift operator is precisely the (possibly promoted) type of its left operand, regardless of the type of the right operand. All bitwise operators are restricted to integers, so the part of the "usual arithmetic conversions" which involve real numbers don't apply to those operators.

  3. If you multiply two unsigned ints, the result will be an unsigned int and the computation is defined for all values:

    A computation involving unsigned operands can never overflow, because a result that cannot be represented by the resulting unsigned integer type is reduced modulo the number that is one greater than the largest value that can be represented by the resulting type. (C §6.2.5/9)

  4. Both C and C++ standards are very clear on this point, and include examples to drive it home. In general, neither signed integer nor floating-point operators are associative, so it is probably only possible to regroup and rearrange a computation if that computation involves only unsigned integer arithmetic.

    An example of a case where regrouping of integer arithmetic would be prohibited appears as Example 6 in §5.1.2.3 of the C standard and as paragraph 9 of §1.9 of the C++ standard. (It's the same example.) Suppose we have a machine with 16-bit ints, where signed overflow results in a trap. In that case, a = a + 32760 + b + 5; cannot be rewritten as a = (a + b) + 32765;:

    if the values for a and b were, respectively, −32754 and −15, the sum a + b would produce a trap while the original expression would not;

  5. Those are the simple, untroublesome cases. Normally you should try to avoid the other ones, but for the record:

    a. Before the above happens, if the type of either operand is narrower than int, then that operand will be promoted to either int or unsigned int. Normally, it will be promoted to int, even if it was unsigned. Only if int is not wide enough to represent all values of the type will the operand be promoted to unsigned int. For example, on most architectures an unsigned char operand will be promoted to an int, not an unsigned int (Although architectures in which char and int are the same width are possible, they are not common.)

    b. Finally, if one type is signed and the other is unsigned, then they will both be converted to:

    • the unsigned type if it is at least as wide as the signed type. (Eg. unsigned int * int => unsigned int)

    • the signed type if it is wide enough to hold all the values of the unsigned type. (Eg. unsigned int * long long => long long if long long is wider than int)

    • the unsigned type corresponding to the signed type if none of the above cases hold.

rici
  • 234,347
  • 28
  • 237
  • 341