Is plain char usually/always unsigned on non-twos-complement systems?

Question

Obviously the standard says nothing about this, but I'm interested more from a practical/historical standpoint: did systems with non-twos-complement arithmetic use a plain char type that's unsigned? Otherwise you have potentially all sorts of weirdness, like two representations for the null terminator, and the inability to represent all "byte" values in char. Do/did systems this weird really exist?

Note that the weirdness becomes even more serious when you consider that `strcmp` is required to compare bytes as `unsigned char`, but presumably would have to stop upon reaching a (either) null terminator byte in either string. — R.. GitHub STOP HELPING ICE, May 29 '11 at 00:06

Michael Burr · Answer 1 · 2011-05-29T01:00:02.703

6

The null character used to terminate strings could never have two representations. It's defined like so (even in C90):

A byte with all bits set to 0, called the null character, shall exist in the basic execution character set

So a 'negative zero' on a ones-complement wouldn't do.

That said, I really don't know much of anything about non-two's complement C implementations. I used a one's-complement machine way back when in university, but don't remember much about it (and even if I cared about the standard back then, it was before it existed).

edited May 29 '11 at 01:00

answered May 29 '11 at 00:54

Michael Burr

333,147
50
533
760

I agree that it wouldn't arise from string literals or string functions, but it seems like if by arithmetic you ended up with a negative zero and assigned that into a char array trying to terminate a string, you could think you terminated it but actually fail... Thus any code which behaves as such probably has a slight portability flaw... – R.. GitHub STOP HELPING ICE May 29 '11 at 01:26

DigitalRoss · Accepted Answer · 2011-05-29T01:06:53.950

It's true, for the first 10 or 20 years of commercially produced computers (the 1950's and 60's) there were, apparently, some disagreements on how to represent negative numbers in binary. There were actually three contenders:

Two's complement, which not only won the war but also drove the others to extinction
One's complement, -x == ~x
Sign-magnitude, -x = x ^ 0x80000000

I think the last important ones-complement machine was probably the CDC-6600, at the time, the fastest machine on earth and the immediate predecessor of the first supercomputer.^1.

Unfortunately, your question cannot really be answered, not because no one here knows the answer :-) but because the choice never had to be made. And this was for actually two reasons:

Two's complement took over simultaneously with byte machines. Byte addressing hit the world with the twos-complement IBM System/360. Previous machines had no bytes, only complete words had addresses. Sometimes programmers would pack characters inside these words and sometimes they would just use the whole word. (Word length varied from 12 bits to 60.)
C was not invented until a decade after the byte machines and two's complement transition. Item #1 happened in the 1960's, C first appeared on small machines in the 1970's and did not take over the world until the 1980's.

So there simply never was a time when a machine had signed bytes, a C compiler, and something other than a twos-complement data format. The idea of null-terminated strings was probably a repeatedly-invented design pattern thought up by one assembly language programmer after another, but I don't know that it was specified by a compiler until the C era.

In any case, the first actually standardized C ("C89") simply specifies "a byte or code of value zero is appended" and it is clear from the context that they were trying to be number-format independent. So, "+0" is a theoretical answer, but it may never really have existed in practice.

^{1. The 6600 was one of the most important machines historically, and not just because it was fast. Designed by Seymour Cray himself, it introduced out-of-order execution and various other elements later collectively called "RISC". Although others tried to claimed credit, Seymour Cray is the real inventor of the RISC architecture. There is no dispute that he invented the supercomputer. It's actually hard to name a past "supercomputer" that he didn't design.}

Are there really no actual C implementations with ones complement or sign-magnitude? Why would the C standard bother to allow them if there were no existing implementations, since they're obviously a major nuisance to care about. — R.. GitHub STOP HELPING ICE, May 29 '11 at 01:34
@R, no implementor would care, sure, due to the nuisance, but standards committees are often not implementors. They would see their task as defining a universal language. C was, after all, *"the portable assembly language",* and I'm sure they imagined crash programs at the mainframe mfr's to implement C. The real reason is probably this: it wasn't obvious in 1989 just how fast all those million-dollar-machines were becoming, not just old, but *"scrap you must pay someone to haul away".* It turned out there was only a fine line between *"expensive room-filling computer"* and *"toxic waste".* — DigitalRoss, May 29 '11 at 05:36
I think that they already ran into the `-0 != 0` problems. As I understand it, you could do things like add `0 + -0` and get a real zero, so it was common to do that before comparing. I suspect that the situation would be somewhat like today with both signed and unsigned thpes and `-0` would definitely not be a string terminator. — DigitalRoss, May 29 '11 at 05:56
@R.. - Ones complement hardware was produced by Univac/Unisys well into the 21st century. It would have been rather nasty if the C language committee made it impossible to implement on their machines because of an incompatible string terminator! — Bo Persson, May 29 '11 at 06:36
@Bo: It's never impossible. You can always use twos complement even if the hardware was intended for ones complement simply by ignoring the signed instructions when you generate machine code and always generating the unsigned instructions. Multiplication and division require some minor patch-up, but basic arithmetic is fine. — R.. GitHub STOP HELPING ICE, May 29 '11 at 14:43

supercat · Answer 3 · 2011-05-29T18:49:57.503

2

I believe it would be almost but not quite possible for a system to have a one's-complement 'char' type, but there are four problems which cannot all be resolved:

Every data type must be representable as a sequence of char, such that if all the char values comprising two objects compare identical, the data objects containing in question will be identical.
Every data type must likewise be representable as a sequence of 'unsigned char'.
The unsigned char values into which any data type can be decomposed must form a group whose order is a power of two.
I don't believe the standard permits a one's-complement machine to special-case the value that would be negative zero and make it behave as something else.

It might be possible to have a standards-compliant machine with a one's-complement or sign-magnitude "char" type if the only way to get a negative zero would be by overlaying some other data type, and if negative zero compared unequal to positive zero. I'm not sure if that could be standards-compliant or not.

EDIT

BTW, if requirement #2 were relaxed, I wonder what the exact requirements would be when overlaying other data types onto 'char'? Among other things, while the standard makes it abundantly clear that one must be able to perform assignments and comparisons on any 'char' values that may result from overlaying another variable onto a 'char', I don't know that it imposes any requirement that all such values must behave as an arithmetic group. For example, I wonder what the legality would be of a machine in which every memory location was physically stored as 66 bits, with the top two bits indicating whether the value was a 64-bit integer, a 32-bit memory handle plus a 32-bit offset, or a 64-bit double-precision floating-point number? Since the standard allows implementations to do anything they like when an arithmetic computation exceeds the range of a signed type, that would suggest that signed types do not necessarily have to behave as a group.

For most signed types, there's no requirement that the type be unable to represent any numbers outside the range specified in limits.h; if limits.h specifies that the minimum "int" is -32767, then it would be perfectly legitimate for an implementation to in fact allow a value of -32768 since any program that tried to do so would invoke Undefined Behavior. The key question would probably be whether it would be legitimate for a 'char' value resulting from the overlay of some other type to yield a value outside the range specified in limits.h. I wonder what the standard says?

edited May 29 '11 at 18:49

answered May 29 '11 at 01:20

supercat

77,689
9
166
211

Where do you get both 1 and 2? I was only aware of 2. – R.. GitHub STOP HELPING ICE May 29 '11 at 01:27
@R.. - It is not an equivalence. If the char representation is identical, the values are the same. But if there are padding bits, they might be equal even if the char representation is different. – Bo Persson May 29 '11 at 06:41
Character types, at least `unsigned char`, by definition cannot have padding bits. – R.. GitHub STOP HELPING ICE May 29 '11 at 14:41
@R.: I'll admit I don't know whether the standard actually specifies (2), but I've seen enough implementations that implicitly rely upon it (e.g. memory-allocation functions that return 'unsigned char') that I'd inferred that it was true. Dangerous I'll admit. I'll add an addendum to my answer. BTW, as for padding bits, they are allowed if and only if there would be no way for a standards-compliant program to be aware of their existence. For example, when C code runs on an original IBM PC or AT, every byte has an extra parity bit in hardware; it's possible to do some tricks in machine code... – supercat May 29 '11 at 18:32
@R:...to deliberately mis-set the parity data for some bytes (and use the NMI that would occur when such bytes are read as a means of trapping reads to uninitialized data) but there would be no way for a standards-compliant program to explicitly control such bits or detect their existence. – supercat May 29 '11 at 18:36
I was talking about the C definition of padding bits, not the hardware definition. The C definition basically amounts to "bits that are visible only through the representation but not the value". And the standard definitely specifies (2) (in 6.2.6) but I'm doubtful that it also specifies (1). If so, it would be almost impossible to make an implementation where plain `char` is signed and not twos complement, I believe. – R.. GitHub STOP HELPING ICE May 29 '11 at 19:09
@R.: I'd thought things like malloc() were originally 'char', in the oldest days of C? Perhaps I was mistaken. In any case, if the standard specifies that all unsigned char values that could result from the overlay of other data must be part of an arithmetic group of order 2^N for some N, that would seem to pretty well preclude what would otherwise be other practical platforms for C. Something of an interesting philosophical debate, perhaps, as to whether that should be a requirement. It would totally preclude accurate garbage collection in interactive systems... – supercat May 29 '11 at 19:18
@R.: ...since the accessibility of pointers could depend upon whether an operator who was shown some numbers wrote them down while they were on the screen. That's getting off-topic, though. – supercat May 29 '11 at 19:20
Indeed, the C language essentially precludes garbage collection for exactly the reason you cited, among many others. It doesn't matter whether it's interactive. Keep in mind you're also free to encrypt or hash pointers, or spread their bits out all over the place and reassemble them later. – R.. GitHub STOP HELPING ICE May 29 '11 at 19:50
Note that the only reason I say "essentially" and not "absolutely" is because C specifies a gigantic finite state machine, not a Turing-equivalent computation environment. Thus, theoretically, the compiler could simply run the program with all possible inputs, determine which halt and which go into infinite loops, and "garbage collect" everything not essential determining the final output. As long as you ignore this theoretical possibility of a compiler that outlives the physical universe by unimaginably large orders of magnitude, garbage collection with C is completely impossible. – R.. GitHub STOP HELPING ICE May 29 '11 at 19:55
@R.: For a non-interactive environment, it would be theoretically possible, albeit intractable, for a computer to determine whether some sequence of calculations could "reconstitute" a pointer; interactive environments push it from an intractable problem to an impossible one, unless one wants to regard every pointer that has ever been output as a permanent memory leak. Personally, I like the concept of an environment where memory references are regarded as a distinct type which cannot be deconstituted and reconstituted. Actually, I'd like some hardware support... – supercat May 29 '11 at 22:25
@R.: ...for a relocatable object reference type with optional offset; such a type would make it possible to have finer control of memory sequencing semantics than is possible with memory barriers. Since the cost of memory barriers increases with the number of cores in a system, I would expect finer-grained control will if anything become more essential with time. – supercat May 29 '11 at 22:28

Is plain char usually/always unsigned on non-twos-complement systems?

3 Answers3