6

Let's consider the following quote from the C++11 standard (the N3376 draft, to be precise):

(2.14.8.5)

If L is a user-defined-string-literal, let str be the literal without its ud-suffix and let len be the number of code units in str (i.e., its length excluding the terminating null character). The literal L is treated as a call of the form

     operator "" X (str , len )

Whereas for all the other types of user-defined literals (floating-point, integer, character) the length is never passed along even if the literal itself is passed as a string. For example:

42_zzz; // calls operator "" _zzz("42") and not operator "" _zzz("42", 2)

Why is there this distinction between string and non-string user-defined literals? Or should I say, why does the implementation pass len for UD string literals? The length, just as in case of other literals, could be deduced by null-termination. What am I missing?

Community
  • 1
  • 1
Armen Tsirunyan
  • 130,161
  • 59
  • 324
  • 434
  • Probably something to do with encodings/character sets. The other paragraphs before that one all have "[ Note: The sequence c1c2 ...ck can only contain characters from the basic source character set. — end note ]". – Mat Oct 28 '12 at 20:00
  • @Mat: But strings with other encoding or character sets are still null-terminated, aren't they? – Armen Tsirunyan Oct 28 '12 at 20:02
  • Null-termination's not enough. I guess the "basic source character set" doesn't include `\0`. – Mat Oct 28 '12 at 20:08

2 Answers2

8

For a string literal it is reasonably conceivable that a null character is embedded in the sequence of the string, e.g., "a\0b". To allow the implementation to consume the entire string literal, even if there is an embedded null character, it needs to know the length of the literal. The other forms for user-defined literals cannot contain embedded zero characters.

Dietmar Kühl
  • 150,225
  • 13
  • 225
  • 380
  • Incidentally, it's possible to define a macro even in C99 which when invoked with an identifier and a string literal will create a compile-time constant structure with that name holding the string's length followed by an array which contains the string's text but not the trailing null (not sure if it can compile clean under C11). Not sure if such a thing would be possible with the a user-defined-string-literal type in C++, but it would seem handy if so. – supercat May 01 '15 at 17:50
  • @supercat: I'm not disputing that you *can* determine the length of a string literal. However, if you only passed a `char const*` you can't determine the length of the string literal! The conventional way to determine the size by finding a null character only determines the size of the string up to the first null character. Somehow the size of the string literal is needed (which is what the macro you describe also depends on: it just uses `sizeof(literal)-1` to determine the number of character in the literal (excluding the trailing `\0`). – Dietmar Kühl May 01 '15 at 21:54
  • Of course, the size of the string is needed, which is why the structure I mentioned puts it before the string; my point was that even in a C macro one can use the length of a literal string as an integer constant. Incidentally, my code used different macros that generate different structures based upon whether the string is 0-63 bytes, 0-2047, or 0-16777215 [using a 1, 2, or 4 byte prefix]. There are also macros to initialize bounds-checked string buffers with one, two, and four-byte prefixes. String handling methods auto-detect the prefix type, and can also... – supercat May 01 '15 at 22:01
  • ...handle a special "indirect flag" prefix byte followed by a structure describing the string. Code wanting to pass an entire string to a method can pass a direct pointer; code wanting to pass a portion can create a structure describing that portion and pass a pointer to that. Safe bounds-checked string handling without having to manually keep track of string length. – supercat May 01 '15 at 22:05
6

Strings are always null terminated in C/C++ but it never mean that they can't contain embedded \0 character, you may have "1234\05678" and while this string is null terminated, it contain an extra '\0` in it.

BigBoss
  • 6,904
  • 2
  • 23
  • 38