4

Here is an interesting oddity about Python's repr:

The tab character \x09 is represented as \t. However this convention does not apply for the null terminator.

Why is \x00 represented as \x00, rather than \0?

Sample code:

# Some facts to make sure we are on the same page
>>> '\x31' == '1'
True
>>> '\x09' == '\t'
True
>>> '\x00' == '\0'
True

>>> x = '\x31'
>>> y = '\x09'
>>> z = '\x00'
>>> x
'1' # As Expected
>>> y
'\t' # Okay
>>> z
'\x00' # Inconsistent - why is this not \0
Andrei Cioara
  • 3,404
  • 5
  • 34
  • 62
  • 1
    Because it's just not commonplace enough. `\0` is not the only such escape that's not used in `repr()`. – Martijn Pieters Oct 19 '18 at 16:43
  • @MartijnPieters I think the question might still be rephrased as "Where in the Python library is '\x09' translated to '\t'?"... because then you could check what is and isn't converted. – mVChr Oct 19 '18 at 16:50
  • See [Is asking "why" questions on language specifications still considered primarily opinion-based?](https://meta.stackoverflow.com/questions/323334/is-asking-why-on-language-specifications-still-considered-as-primarily-opinio) on [meta]. Granted, this is an implementation decision, but the same considerations apply. – Charles Duffy Oct 19 '18 at 17:00

2 Answers2

6

The short answer: because that's not a specific escape that is used. String representations only use the single-character escapes \\, \n, \r, \t, (plus \' when both " and ' characters are present) because there are explicit tests for those.

The rest is either considered printable and included as-is, or included using a longer escape sequence (depending on the Python version and string type, \xhh, \uhhhh and \Uhhhhhhhh, always using the shortest of the 3 options that'll fit the value).

Moreover, when generating the repr() output, for a string consisting of a null byte followed by a digit from '1' through to '7' (so bytes([0x00, 0x49]), or bytes([0x00, 0x4A]), etc), you can't just use \0 in the output without then also having to escape the following digit. '\01' is a single octal escape sequence, and not the same value as '\x001', which is two bytes. While forcing the output to always use three octal digits (e.g. '\0001') could be a work-around, it is just simpler to stick to a standardised, simpler escape sequence format. Scanning ahead to see if the next character is an octal digit and switching output styles would just produce confusing output (imagine the question on SO: What is the difference between '\x001' and '\0Ol'?)

The output is always consistent. Apart from the single quote (which can appear either with ' or \', depending on the presence of " characters), Python will always use same escape sequence style for a given codepoint.

If you want to study the code that produces the output, you can find the Python 3 str.__repr__ implementation in the Objects/unicodeobject.c unicode_repr() function, which uses

/* Escape quotes and backslashes */
if ((ch == quote) || (ch == '\\')) {
    PyUnicode_WRITE(okind, odata, o++, '\\');
    PyUnicode_WRITE(okind, odata, o++, ch);
    continue;
}


/* Map special whitespace to '\t', \n', '\r' */
if (ch == '\t') {
    PyUnicode_WRITE(okind, odata, o++, '\\');
    PyUnicode_WRITE(okind, odata, o++, 't');
}
else if (ch == '\n') {
    PyUnicode_WRITE(okind, odata, o++, '\\');
    PyUnicode_WRITE(okind, odata, o++, 'n');
}
else if (ch == '\r') {
    PyUnicode_WRITE(okind, odata, o++, '\\');
    PyUnicode_WRITE(okind, odata, o++, 'r');
}

for single-character escapes, followed by additional checks longer escapes below. For Python 2, a similar but shorter PyString_Repr() function does much the same thing.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
4

If it tried to use \0, then it would have to special-case when numbers immediately followed it, to prevent them from being interpreted as an octal literal. Always using \x00 is simpler and always correct.