1

The ANSI X3.159-1989 "Programming Language C" standard states in the chapter "5.2.1.2 - Multibyte characters" that:

For both [source and execution] character sets the following shall hold:

  • A byte with all bits zero shall be interpreted as a null character independent of shift state.
  • A byte with all bits zero shall not occur in the second or subsequent bytes of a multibyte character.

Does it mean that for the translation and execution environments next statements are true?:

  1. Both source and execution character sets might have a multibyte value, used to represent the null character, for each different shift state. [Thoughts: if the translation or execution environment can switch between different shift states (that can differ the number of bytes used to represent a character), then it should somehow detect the null character - not only as the one byte "null character" from the basic character set, but as, for example, a two byte "null character" for a particular shift state.] P.S. that might be a misconception of how character values are being interpreted in a string literal and etc. by translation and execution environment.
  2. Those characters can be represent only as a values with the first byte set to "0" [i.e. first byte with all bits zero], so there is a wide range of how to represent it: "FFFF 0000", "ABCD 0000" and etc.
  3. The "null character" is defined only in the basic execution character set. Both rules in a quote below are applicable to both extended translation and execution character sets. So that, multibyte representation of the "null character" can be in both translation and execution environment, and it's possible to use the multibyte "null character" in source code without the use of escape-sequences, but instead writing that character directly in some kind of literal.

Or the "null character" can only be represent as a single byte value, and its one and only such character, defined by the basic execution character set?

Eric Postpischil
  • 195,579
  • 13
  • 168
  • 312
CoSalamander
  • 121
  • 7
  • 1
    Just adding [C11 5.2.1.2 in HTML](https://port70.net/~nsz/c/c11/n1570.html#5.2.1.2). – pmg Jun 01 '22 at 15:09
  • @pmg, the question is about C89 for the case C11 differ about the statements above. – CoSalamander Jun 01 '22 at 15:17
  • 1
    Just adding [C89 2.2.1.2 in HTML](https://port70.net/~nsz/c/c89/c89-draft.html#2.2.1.2). ... and I note the original question refers to paragraph 5.2.1.2 which "my" copy of C89 does not include :-) – pmg Jun 01 '22 at 16:09
  • @pmg I had to note that I'm reading ISO/IEC 9899:1990, which is referred to as C90, but as far as I see it's not a big difference. Thank you for the link. – CoSalamander Jun 01 '22 at 17:04

1 Answers1

3

Does it mean that for the translation and execution environments next statements are true?:

Both source and execution character sets might have a multibyte value, used to represent the null character, for each different shift state.

No. "null character" is a defined term:

A byte with all bits set to 0, called the null character, shall exist in the basic execution character set [...]

In the current standard (C17) that's in paragraph 5.2.1/2, but identical text goes all the way back to C89.

The point of the provisions quoted in the question is that C implementations don't have to care about shift state or extended characters to recognize null characters, and that using a null character as a string terminator does not cause truncation of any multibyte character.

Those characters can be represent only as a values with the first byte set to "0" [i.e. first byte with all bits zero], so there is a wide range of how to represent it: "FFFF 0000", "ABCD 0000" and etc.

No. Again, for the purposes of the language spec, "null character" is a defined term meaning a byte with value 0. The point of the provisions under discussion is that implementations don't need to consider any broader context when attempting to identify a null character. For example, string functions such as strcpy() and strlen() don't need to know or care anything about character encoding, shift state, or multibyte characters. They just recognize the end of the string by a null character.

The "null character" is defined only in the basic execution character set.

The C specification does not require the source character set to have a null character, but the text you quoted says that if it includes a single-byte character with value 0, then that character is a null character for C's purposes.

Both rules in a quote below are applicable to both extended translation and execution character sets.

Yes.

So that, multibyte representation of the "null character" can be in both translation and execution environment, [...]

No. Again, a null character is a byte with value 0, regardless of character set or encoding.

Or the "null character" can only be represent as a single byte value, and its one and only such character, defined by the basic execution character set?

There can be a null character in the source character set, too, though it is not required. And every extended character set embeds the corresponding basic character set, so in that sense every extended execution character set defines the null character, and extended source character sets may also do. However, in every character set that includes the null character, that character is represented as a byte with value zero, and in every character set that contains a byte with value zero in any character representation, that byte represents the null character.

John Bollinger
  • 160,171
  • 8
  • 81
  • 157
  • 1
    Can there be a multibyte representation of the null character in addition to the all-bits-zero byte? – Eric Postpischil Jun 01 '22 at 16:36
  • @EricPostpischil, your question presumes a different definition of "null character" than the one used by the language spec. The spec uses the term with respect to character representation, and in that sense "multibyte representation of the null character" is not meaningful. However, inasmuch as you seem to be asking about a character code of zero encoded with a multibyte representation, yes, it is conceivable that such a representation could satisfy C's restrictions. – John Bollinger Jun 01 '22 at 16:55
  • Your answer gives a good explanation to me what the null character is. However, I don't understand two things: 1. If C implementations don't have to care about extended characters to recognize null character, how can it be recognized then? [Thoughts: the environment scans the string until it finds a corresponding code in some kind of set encoding, beginning from one byte and increasing the visibility to two bytes and etc., so it just can't be misinterpreted.] 2. What is the reason to allow the first byte [eq. to lower byte?] to be zero? – CoSalamander Jun 01 '22 at 16:59
  • @CoSalamander, (1) the null character is a byte with value 0. This is visible directly in memory. There is no special requirement to be able to recognize it. (2) The "first" byte is first in memory order. It has nothing to do with the place value of the byte when interpreted as part of a multi-byte number. No special provision need be expressed for that because interpreting a null character at such a position as a string terminator cannot truncate a multibyte character in the middle, as would happen if a null character appeared at another position within one. – John Bollinger Jun 01 '22 at 18:05
  • To clarify, @CoSalamander, in practice multibyte characters with first byte 0 cannot appear *semantically* in C strings, because that first-position null character will be interpreted as a string terminator. That makes it unuseful for an extended character set to define such characters. I am uncertain why the spec does not outright ban them, as it does multibyte characters containing null characters at other positions, but it doesn't. – John Bollinger Jun 01 '22 at 18:30
  • @JohnBollinger, as for "There is no special requirement to be able to recognize [the null character]", do I understand you right that the way how the null character can be recognized in a sequence of bytes depends on the realization of translation and execution environments ? As for your statement "I am uncertain why the spec does not outright ban them...", the C11 (or some earlier standard) declares that "*Such a byte shall not occur as part of any other multibyte character*", what might solve an unexpected null-termination of a string in some realizations. – CoSalamander Jun 02 '22 at 09:34
  • 1
    @CoSalamander, "the realization of the translation and execution environments" covers pretty much everything there is, so yes, recognition of null characters depends on that. But it *does not* depend on the choice of source or execution *character set*, which in the terminology of the C language spec is inclusive of considerations of what would be called "character encoding" in some other domains. – John Bollinger Jun 02 '22 at 13:08