How to properly add hex escapes into a string-literal?

Question

When you have string in C, you can add direct hex code inside.

char str[] = "abcde"; // 'a', 'b', 'c', 'd', 'e', 0x00
char str2[] = "abc\x12\x34"; // 'a', 'b', 'c', 0x12, 0x34, 0x00

Both examples have 6 bytes in memory. Now the problem exists if you want to add value [a-fA-F0-9] after hex entry.

//I want: 'a', 'b', 'c', 0x12, 'e', 0x00
//Error, hex is too big because last e is treated as part of hex thus becoming 0x12e
char problem[] = "abc\x12e";

Possible solution is to replace after definition.

//This will work, bad idea
char solution[6] = "abcde";
solution[3] = 0x12;

This can work, but it will fail, if you put it as const.

//This will not work
const char solution[6] = "abcde";
solution[3] = 0x12; //Compilation error!

How to properly insert e after \x12 without triggering error?

Why I'm asking? When you want to build UTF-8 string as constant, you have to use hex values of character if it is larger than ASCII table can hold.

Duplicate: https://stackoverflow.com/questions/35180528/limit-the-length-of-a-hexadecimal-escape-sequence-in-a-c-string. I'll close that one as I think the answers posted here are more complete, with the standard quoted inside the answer rather than in comments. — Lundin, Aug 10 '17 at 12:41

user694733 · Accepted Answer · 2017-08-10T11:59:59.023

Use 3 octal digits:

char problem[] = "abc\022e";

or split your string:

char problem[] = "abc\x12" "e";

Why these work:

Unlike hex escapes, standard defines 3 digits as maximum amount for octal escape.

6.4.4.4 Character constants

...

octal-escape-sequence:
    \ octal-digit
    \ octal-digit octal-digit
    \ octal-digit octal-digit octal-digit

...

hexadecimal-escape-sequence:
    \x hexadecimal-digit
    hexadecimal-escape-sequence hexadecimal-digit

String literal concatenation is defined as a later translation phase than literal escape character conversion.
5.1.1.2 Translation phases

...
1. Each source character set member and escape sequence in character constants and string literals is converted to the corresponding member of the execution character set; if there is no corresponding member, it is converted to an implementation- defined member other than the null (wide) character. 8)
2. Adjacent string literal tokens are concatenated.

A third alternative is to do everything explicitly: `char solution[] = {'a', 'b', 'c', 0x12, 'e', '\0'};` — Lundin, Aug 10 '17 at 12:38
Or even offset the escapes "string" altogether. `"abc" "\x12" "e";` for clarity. — chux - Reinstate Monica, Aug 10 '17 at 12:38

score 28 · Answer 2 · edited Nov 22 '22 at 13:37

Since string literals are concatenated early on in the compilation process, but after the escaped-character conversion, you can just use:

char problem[] = "abc\x12" "e";

though you may prefer full separation for readability:

char problem[] = "abc" "\x12" "e";

For the language lawyers amongst us, this is covered in C11 5.1.1.2 Translation phases (my emphasis):

Each source character set member and escape sequence in character constants and string literals is converted to the corresponding member of the execution character set; if there is no corresponding member, it is converted to an implementation-defined member other than the null (wide) character.

Adjacent string literal tokens are concatenated.

score 7 · Answer 3 · answered Aug 10 '17 at 13:06

7

Why I'm asking? When you want to build UTF-8 string as constant, you have to use hex values of character is larger than ASCII table can hold.

Well, no. You don't have to. As of C11, you can prefix your string constant with u8, which tells the compiler that the character literal is in UTF-8.

char solution[] = u8"no need to use hex-codes áé§µ";

(Same thing is supported by C++11 as well, by the way)

answered Aug 10 '17 at 13:06

Damon

67,688
20
135
185

Well yes. Im in C99. Thanks for clue. – unalignedmemoryaccess Aug 10 '17 at 13:08
3

People might be fishing for the non-printable characters 0 to 31 of the classic 7 bit ASCII table. – Lundin Aug 10 '17 at 13:22
1

@Lundin Shouldn't they rather omit character 0...? – CiaPan Aug 10 '17 at 21:05
The standard doesn't require that unicode characters be supported in source code – M.M Aug 15 '17 at 01:07
@M.M The C++ standard (question is C, but they're the same, and I have the C++ standard ready while I would need to search for the C one) says in 2.14.5/6: _"is initialized with the given characters as encoded in UTF-8"_. How would that work if it isn't at the same time required to support these characters? Also, you may even use _universal-character-name_ class as defined in TR 10176:2003 **for identifiers** (per n3146 there's some refinements), which basically means most unicode characters except half-width, combining, and punctuation. This is a _minimum required_ set, not optional. – Damon Aug 15 '17 at 09:02
1

@Damon C and C++ are different languages. The C standard isn't the same (I searched it for your text and it did not come up). You can use universal character constants, e.g. `u8"\u12345678"`, because backslash, `u`, `1` etc. are in the source character set; however the denoted unicode character might not be in the source character set. The source code is only allowed to contain characters from the source character set, which could be 7-bit ASCII for example (this is unrelated to the execution character set). – M.M Aug 15 '17 at 09:24
@M.M One of the last (_the_ last?) draft before the standard which is publicly available on the net is [n1570](http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf) 6.4.2.1, page 59, note how _universal-character-name_ belongs to _identifier-nondigit_. 6.4.5/3, page 70 says: _"sequence of zero or more multibyte characters"_. C and C++ _are the same_, with minuscule differences, as far as this detail is concerned (I'm aware that they are generally different languages). Different wording, same effect. – Damon Aug 15 '17 at 09:30
1

@Damon this is a C question, and the C Standard exists, there is literally no reason to involve C++ – M.M Aug 15 '17 at 10:21
@M.M: Then please just care to read the standard (or the freely available draft, link provided above) to see that you are wrong. I'm not much interested in further discussing attempts at finding a fly in the ointment when there is no fly. – Damon Aug 15 '17 at 11:25
1

I already did and explained in my previous comment. universal-character-name is allowed because all of the characters comprising such a thing are in the basic source character set, but unicode characters may not be. It's implementation-defined what is in the *extended source character set* and that could be empty. – M.M Aug 15 '17 at 12:32
@M.M Sigh... My last attempt. The above quote on page 69 refers to Annex D which _"lists the hexadecimal code values that are valid in [...] identifiers"_. Under 1) you have ASCII, under 2) you have roman extended, and under 3) to 8) is "most of Asia except half-width". You are explicitly allowed to use the characters within those ranges in identifiers. Fullstop. That means, consequentially, a compiler **must** support source code that contains them, there is no other way it could be. – Damon Aug 15 '17 at 14:00
1

Annex D is saying that you can have `foo\u00003040` for example as an identifier, but not `foo\u00000300`. The implementation doesn't have to support the corresponding unicode character in the source in either case. You already quoted the grammar for identifiers, which included *universal-character-name* , which means a `\u` hex code, not the actual character. – M.M Aug 15 '17 at 14:06
Sorry this is a bad answer. The question is obviously for C, which only supports `"..."` and `L"..."` strings, not `u8"..."` strings. – Daniel Dec 21 '20 at 07:25

How to properly add hex escapes into a string-literal?

3 Answers3

6.4.4.4 Character constants

5.1.1.2 Translation phases

Linked

Related