21

After reading this question on signed/unsigned compares (they come up every couple of days I'd say):

I wondered why we don't have proper signed unsigned compares and instead this horrible mess? Take the output from this small program:

#include <stdio.h>
#define C(T1,T2)\
 {signed   T1 a=-1;\
 unsigned T2 b=1;\
  printf("(signed %5s)%d < (unsigned %5s)%d = %d\n",#T1,(int)a,#T2,(int)b,(a<b));}\

 #define C1(T) printf("%s:%d\n",#T,(int)sizeof(T)); C(T,char);C(T,short);C(T,int);C(T,long);
int main()
{
 C1(char); C1(short); C1(int); C1(long); 
}

Compiled with my standard compiler (gcc, 64bit), I get this:

char:1
(signed  char)-1 < (unsigned  char)1 = 1
(signed  char)-1 < (unsigned short)1 = 1
(signed  char)-1 < (unsigned   int)1 = 0
(signed  char)-1 < (unsigned  long)1 = 0
short:2
(signed short)-1 < (unsigned  char)1 = 1
(signed short)-1 < (unsigned short)1 = 1
(signed short)-1 < (unsigned   int)1 = 0
(signed short)-1 < (unsigned  long)1 = 0
int:4
(signed   int)-1 < (unsigned  char)1 = 1
(signed   int)-1 < (unsigned short)1 = 1
(signed   int)-1 < (unsigned   int)1 = 0
(signed   int)-1 < (unsigned  long)1 = 0
long:8
(signed  long)-1 < (unsigned  char)1 = 1
(signed  long)-1 < (unsigned short)1 = 1
(signed  long)-1 < (unsigned   int)1 = 1
(signed  long)-1 < (unsigned  long)1 = 0

If I compile for 32 bit, the result is the same except that:

long:4
(signed  long)-1 < (unsigned   int)1 = 0

The "How?" of all this is easy to find: Just goto section 6.3 of the C99 standard or chapter 4 of C++ and dig up the clauses which describe how the operands are converted to a common type and this can break if the common type reinterprets negative values.

But what about the "Why?". As we can see, the '<' fails in 50% of all cases, also it depends on the concrete sizes of the types, so it is platform dependent. Here are some points to consider:

  • The convert & compare process is not really a prime example for the Rule of Least Surprise

  • I don't believe that there is code out there, which relies on the proposition that (short)-1 > (unsigned)1 and is not written by terrorists.

  • This is all terrible when you're in C++ with template code, because you need type trait magic to knit a correct "<".


After all, comparing signed and unsigned value of different types is easy to implement:

signed X < unsigned Y -> (a<(X)0) || ((Z)a<(Z)b) where Z=X|Y 

The pre-check is cheap and can also be optimized away by the compiler if a>=0 can statically be proven.

So here's my question:

Would it break the language or existing code if we'd add safe signed/unsigned compares to C/C++?

("Would it break the language" means would we need to make massive changes to different parts of the language to accommodate this change)


UPDATE: I've ran this on my good old Turbo-C++ 3.0 and got this output:

char:1
(signed  char)-1 < (unsigned  char)1 = 0

Why is (signed char)-1 < (unsigned char) == 0 here?

Community
  • 1
  • 1
Nordic Mainframe
  • 28,058
  • 10
  • 66
  • 83
  • 3
    Your use of the example `(short)-1 > (unsigned)1` is something of a strawman. While it's unlikely that someone will care about constant comparisons like that, there is a plenty of code which depends on the (well-specified) behavior when evaluating more complex expressions. – R.. GitHub STOP HELPING ICE Aug 13 '10 at 12:24

6 Answers6

12

My answer is for C only.

There is no type in C that can accomodate all possible values of all possible integer types. The closest C99 comes to this is intmax_t and uintmax_t, and their intersection only covers half their respective range.

Therefore, you cannot implement a mathematical value comparison of such as x <= y by first converting x and y to a common type and then doing a simple operation. This is a major departure from a general principle of how operators work. It also breaks the intuition that operators correspond to things that tend to be single instructions in common hardware.

Even if you added this additional complexity to the language (and extra burden to implementation writers), it wouldn't have very nice properties. For example, x <= y would still not be equivalent to x - y <= 0. If you wanted all these nice properties, you'd have to make arbitrary-sized integers part of the language.

I'm sure there's plenty of old unix code out there, possibly some running on your machine, that assumes that (int)-1 > (unsigned)1. (Ok, maybe it was written by freedom fighters ;-)

If you want lisp/haskell/python/$favorite_language_with_bignums_built_in, you know where to find it...

Gilles 'SO- stop being evil'
  • 104,111
  • 38
  • 209
  • 254
  • +1 for bringing up the still-missing equivalence between `x<=y` and `x-y<=0`. – R.. GitHub STOP HELPING ICE Aug 13 '10 at 12:38
  • Hm. I would have to make the type of x-y signed in all cases where any of x or y were signed if I want to save this. I think this counts as 'break the language'. – Nordic Mainframe Aug 13 '10 at 13:09
  • @Luther: you have to make the type of `x-y` big enough to accomodate `UINTMAX_MAX - INTMAX_MIN`, which means it must be one bit bigger than (`u`)`intmax_t`. Therefore, there cannot be a biggest integer type; in other words, the language must have bignums built-in. I wouldn't call the resulting language C. – Gilles 'SO- stop being evil' Aug 13 '10 at 13:36
  • Why? `int1+int2` can't hold `INTMAX_MAX+INTMAX_MAX` either. And almost everyone's happy with ints *that one can't even negate without invoking UB*. So we could either decide to let x-y underflow in the UB (trap, wrap-around) or specify under/overflow behavior that actually makes sense. – Nordic Mainframe Aug 13 '10 at 14:20
  • @Luther: limited-size types, mathematically-intuitive `x<=y`, `x<=y` equivalent to `x-y<=0`: pick two. Your proposal keeps 1 and adds 2 to the language, but I don't see much point in 2 without 3. – Gilles 'SO- stop being evil' Aug 13 '10 at 14:35
  • I'm not convinced that *"intuitive equivalences should hold"* can really make a point in supporting C's way to handle signed/unsigned. For example, `-1<1u` is `0`, but `0<1u+1` is `1` obviously, while `-1-1u<0` is `0` again. – Nordic Mainframe Aug 13 '10 at 14:54
  • 1
    As far as the mathematics go, the problem is that naive coders expect `<` to be a mathematical order relation, while the ring they're working in is not the integers but **integers mod 2^n**. There's simply no way to have an order relation that fits the usual algebraic rules (e.g. `a a+c – R.. GitHub STOP HELPING ICE Aug 13 '10 at 15:33
  • 2
    *"As far as the mathematics go, the problem is that naive coders expect < to be a mathematical order relation, while the ring they're working in is not the integers but integers mod 2^n."* -- We're with mod 2^n only as long as we stay within unsigned numbers. int *has* this order relation (barred UB). What I'm asking is, if it is really necessary to have a completely inconsistent order (which depends on the size of the operands) when it comes to **mixed** sign arithmetic. – Nordic Mainframe Aug 13 '10 at 17:06
8

Yes it would break the language/existing code. The language, as you have noted, carefully specifies the behavior when signed and unsigned operands are used together. This behavior with comparison operators is essential for some important idioms, like:

if (x-'0' < 10U)

Not to mention things like (equality comparison):

size_t l = mbrtowc(&wc, s, n, &state);
if (l==-1) ... /* Note that mbrtowc returns (size_t)-1 on failure */

As an aside, specifying "natural" behavior for mixed signed/unsigned comparisons would also incur a significant performance penalty, even in programs which are presently using such comparisons in safe ways where they already have their "natural" behavior due to constraints on the input which the compiler would have a hard time determining (or might not be able to determine at all). In writing your own code to handle these tests, I'm sure you've already seen what the performance penalty would look like, and it's not pretty.

R.. GitHub STOP HELPING ICE
  • 208,859
  • 35
  • 376
  • 711
  • 1
    Another common example: `if (snprintf(buf, sizeof buf, ...) >= sizeof buf)` - this catches both errors (return value of -1) and overflows with a single comparison, due to the fact that `size_t` is unsigned. – R.. GitHub STOP HELPING ICE Aug 13 '10 at 12:27
  • +1: Good examples. I'm a bit blind on that spot because I usually avoid code which assumes a exploitable representation of ints<0 in unsigned. – Nordic Mainframe Aug 13 '10 at 12:43
  • 1
    @Luther: converting a negative signed value to an unsigned type is well-defined, and doesn't depend on the representation. – Mike Seymour Aug 13 '10 at 13:45
  • @Michael: The value of (unsigned)(-1) depends on the size of unsigned. Btw. the method given in 6.3.1.3 seems to suggest that (unsigned)-1 has the same bit pattern as -1 in two-complement arithmetic, does it? – Nordic Mainframe Aug 13 '10 at 14:13
  • @R: I'm not sure if `size_t x ... if (x==-1)` is actually correct. C99 says about `size_t` that is has to be *an unsigned integer type* and should not have *an integer conversion rank greater than signed long*. This seems to suggest that size_t could in fact be, for example a 16 bit `unsigned short` while having 32-bit `int`. In that case `(x==-1)` would fail, wouldn't it? – Nordic Mainframe Aug 13 '10 at 15:09
  • @Luther: I suppose it would, but there has never been, and will never be, a C implementation where `size_t` is smaller than `int`. While the standard doesn't forbid it, it's completely ridiculous. – R.. GitHub STOP HELPING ICE Aug 13 '10 at 15:27
  • @Luther: `unsigned` is the same size as `int`, so `(unsigned)(-1)` will be `UINT_MAX`. And yes, the conversion is equivalent to reinterpreting the two-complement bit pattern. You're right about `size_t`; if it is smaller than `int`, then it would be converted to a positive-valued `int` for the comparison. – Mike Seymour Aug 13 '10 at 15:35
  • @R..: As far as I can tell, a freestanding C89 implementation for a microcontroller could make `size_t` an 8-bit `unsigned char` (the only minimum requirement I can find is §2.2.4.1 “32767 bytes in an object (in a hosted environment only)”). In C99, `SIZE_MAX>=65535`, so `size_t` shorter than `unsigned int` would be pathological. – Gilles 'SO- stop being evil' Aug 13 '10 at 16:00
  • @Gilles: it doesn't seem pathological to me to have a 32-bit `int` and a 16-bit `size_t`. I think there have been processors with 32-bit data registers and a 16-bit address bus, where those would be the natural sizes. – Mike Seymour Aug 13 '10 at 17:57
  • The first example is interesting because it relies upon the unsigned thing being compared against being an unsigned type *which is `int` or larger*. That would certainly be true of a constant with a `u` suffix, but not with something like `uint32_t`. – supercat Sep 05 '13 at 20:23
  • Arguably `if (l == -1)` could be `if (!~l)`, which is harder to read at first glance but trivial if it was a common idiom. – user541686 May 15 '14 at 16:23
7

I do not think it would break the language, but yes, it could break some existing code (and the breakage would be probably hard to detect at the compiler level).

There exists a lot more code written in C and C++ than you and I together can imagine (some of it may be even written by terrorists).

Relying on "proposition that (short)-1 > (unsigned)1" may be done unintentionally by someone. There exists a lot of C code dealing with complex bit manipulation and similar things. It is quite possible some programmer may be using the current comparison behaviour in such code. (Other people have already provided nice examples of such code, and a the code is even simpler than I would expect).

Current solution is to warn on such comparisons instead, and leave the solution to the programmer, which I think is in a spirit how C and C++ works. Also, solving it on a compiler level would incur a performance penalty, and this is something C and C++ programmers are extremely sensitive at. Two tests instead of one might seem like a minor issue to you, but there is probably plenty of C code where this would be an issue. It could be solved e.g. by forcing the previous behaviour by using explicit casts to a common data type - but this again would require programmer attention, therefore it is no better than a simple warning.

Suma
  • 33,181
  • 16
  • 123
  • 191
  • There's no reason that this couldn't be added to a later spec and old code continue to compile at C89 or C99 – Peter Gibson Aug 13 '10 at 12:23
  • 4
    Sure there is - the language is not supposed to gratuitously break things from version to version. What do you do when you're using a library that expects one behavior with an application that expects the other? (Especially if part of the library code is macros or inline functions in headers?) Sure there will sometimes be minor breakage, but something as fundamental as this should never be touched! – R.. GitHub STOP HELPING ICE Aug 13 '10 at 12:32
1

I think C++ is like the Roman empire. Its big, and too established to fix the things that are going to destroy it.

c++0x - and boost - are examples of a horrible horrible syntax - the kind of baby only its parents can love - and are a long long way from the simple elegant (but severely limited) c++ of 10 years ago.

The point is, by the time one has "fixed" something as terribly simple as comparisons of integral types, enough legacy & existing c++ code has been broken that one might as well just call it a new language.

And once broken, there is so much else that is also eligible for retroactive fixing.

Chris Becke
  • 34,244
  • 12
  • 79
  • 148
  • 2
    The beast is not my kid, and yet I like it... What parts of C++0x do you deem *horrible horrible*? There are some that are awkward... but all languages have those sort of things (just name a language) – David Rodríguez - dribeas Aug 13 '10 at 13:08
  • I refer not to any of the features of C++0x - which are all fantastic and sorely needed. I refer to the fact that c++0x and boost (and the stl) are implemented on top of a horrid template (and namespace) syntax. As a simple question, why is `::` used as a scope resolution operator. There are virtually no situations where a `.` would cause ambiguity. Or - instead of templates, why didn't c+ just have `types` as first class variables (or given first class variable usage). i.e. replace `Array theArray`; with `Array theArray(int);`. – Chris Becke Aug 13 '10 at 13:24
  • I agree that templates could be a lot cleaner, but I don't see why a `::` operator is particularly horrible. In general, I don't really care what individual syntactic tokens look like. Surely that's not what makes the language ugly. – jalf Aug 13 '10 at 14:07
  • `::` is horrible, as is `->`. If the array decomposition rule had been extended to classes, then all class variables would have decomposed to implicit class references. Meaning that `.` could have been used as a universal dereferencing / scope resolution operator. Doing more with less, extended too far, can result in languages that look like lisp. But generally, being able to say `a.b.c.d.e` without caring that a ... e might be a namespace, class name, class instance, class reference or class member is far more elegant than saying `a::b::c.d->e`. imho. – Chris Becke Aug 13 '10 at 15:05
0

The only ways for a language to define rules that can come close to upholding the Principle of Least Surprise at run-time when using combining operands of different C-language types would be to either have the compiler forbid implicit type conversions in at least some contexts (shifting the 'surprise' to "why won't this compile?" and making it less likely to cause unexpected bugs down the road), define multiple types for each storage format (e.g. both wrapping and non-wrapping variants of each integer type), or both.

Having multiple types for each storage format, e.g. both wrapping and non-wrapping versions of signed and unsigned 16-bit integers, could allow the compiler to distinguish between "I'm using a 16-bit value here in case it makes things more efficient, but it's never going to exceed the range 0-65535 and I wouldn't care what happened if it did)" and "I'm using a 16-bit value which needs to wrap to 65535 it goes negative". In the latter case, a compiler that used a 32-bit register for such a value would have to mask it after each arithmetic operation, but in the former case the compiler could omit that. With regard to your particular wish, the meaning of a comparison between a non-wrapping signed long and a non-wrapping unsigned long would be clear, and it would be appropriate for a compiler to generate the multi-instruction sequence necessary to make it happen (since conversion of a negative number to a non-wrapping unsigned long would be Undefined Behavior, having the compiler define a behavior for comparison operators on those types would not conflict with anything else that might be specified).

Unfortunately, beyond having the compiler generate warnings for mixed-operand comparisons, I don't really see much that can be done with the C language as it exists without adding new types to it as described above; although I would regard the addition of such new types as an improvement, I wouldn't hold my breath.

supercat
  • 77,689
  • 9
  • 166
  • 211
0

If a comparison between integer types compared the actual mathematical values, I'd want the same to happen for comparisons between integer and floating-point. And comparing the exact values of an arbitrary 64 bit integer and an arbitrary double-precision floating point number is quite difficult. But then the compiler would probably be better at it than me.

gnasher729
  • 51,477
  • 5
  • 75
  • 98