"surrogateescape" cannot escape certain characters

Question

Regarding reading and writing text files in Python, one of the main Python contributors mentions this regarding the surrogateescape Unicode Error Handler:

[surrogateescape] handles decoding errors by squirreling the data away in a little used part of the Unicode code point space. When encoding, it translates those hidden away values back into the exact original byte sequence that failed to decode correctly.

However, while opening a file and then attempting to write the output to another file:

input_file = open('someFile.txt', 'r', encoding="ascii", errors="surrogateescape")
output_file = open('anotherFile.txt', 'w')

for line in input_file:
    output_file.write(line)

Results in:

  File "./break-50000.py", line 37, in main
    output_file.write(line)
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc3' in position 3: surrogates not allowed

Note that the input file is not ASCII. However, it transverses hundreds of lines that contain non-ASCII characters just fine before it throws the exception on one particular line. The output file must be ASCII and loosing some characters is just fine.

This is the line that is throwing the error when decoded as UTF-8:

'Zoë\'s Coffee House'

This is the hex encoding:

$ cat z.txt | hd
00000000  27 5a 6f c3 ab 5c 27 73  20 43 6f 66 66 65 65 20  |'Zo..\'s Coffee |
00000010  48 6f 75 73 65 27 0a                              |House'.|
00000017

Why might the surrogateescape Unicode Error Handler be returning a character that is not ASCII? This is with Python 3.2.3 on Kubuntu Linux 12.10.

Did you specify encoding in the header of your python file? Just a quick check. — Dylan Lawrence, Jan 14 '14 at 14:37
@DylanLawrence: That has absolutely nothing to do with the data the code handles. — Ignacio Vazquez-Abrams, Jan 14 '14 at 14:40
@DylanLawrence: 1) This is Python 3, so not necessary. 2) This has to do with reading data, not with the encoding of the Python file itself. — dotancohen, Jan 14 '14 at 14:41
@dotancohen My apologies, it slipped my mind that python 3 does it on its own. — Dylan Lawrence, Jan 14 '14 at 14:43
@DylanLawrence: it's not necessary in Python 2, either. The data isn't in unicode strings in the source. — Wooble, Jan 14 '14 at 14:44

score 17 · Accepted Answer · answered Jan 14 '14 at 14:39

Why might the surrogateescape Unicode Error Handler be returning a character that is not ASCII?

Because that's what it explicitly does. That way you can use the same error handler the other way and it will know what to do.

3>> b"'Zo\xc3\xab\\'s'".decode('ascii', errors='surrogateescape')
"'Zo\udcc3\udcab\\'s'"
3>> "'Zo\udcc3\udcab\\'s'".encode('ascii', errors='surrogateescape')
b"'Zo\xc3\xab\\'s'"

score 10 · Answer 2 · answered Sep 12 '14 at 14:58

A lone surrogate should NOT be encoded in UTF-8 -- which is precisely why it was used for the internal representation of invalid input.

In real life, it is pretty common to get data that is invalid for the encoding it is "supposed" to be in. For example, this question was inspired by text that appears to be in Latin-1, when ASCII or UTF-8 was expected. I put "supposed" in quotes, because it is pretty common for the "encoding information" to just be a guess, perhaps unrelated to the actual file.

By default, xml processing (and most unicode processing) is strict -- the entire process gives up even though it could process hundreds of other lines just fine.

Decoding with errors=replace would turn that line into "Zo?'s Coffee House", which is an improvement. (Well, unless you tried to replace invalid characters with something else that isn't valid either -- and the official unicode replacement character isn't valid in ASCII, which is why a '?' is typically used for encoding.)

surrogateescape is used when the programmer decides "You know what? I don't care if the data is garbage. Maybe I have the wrong codec ... so I'll just pass the unknown bytes along as-is." Python does have to store (but avoid interpreting) those bytes internally until they are passed along.

Using unpaired surrogates allows Python to store the invalid bytes without extra escaping. Precisely because unpaired surrogates are invalid, they will never appear in valid input. (And if they occur anyhow, they'll be interpreted as a pair of unrecognized bytes, both of which get preserved for output.)

The original poster's problem is that he was trying to print out that internal representation directly, instead of reversing the mapping first, and the internal representation had bytes that (intentionally) weren't valid ... so the default (strict) error handler refused.

score -1 · Answer 3 · answered Jan 15 '14 at 12:43

-1

For what reason should a low-surrogate DCC3 be encoded in utf-8? This is not allowed and useless because a surrogate is NOT a character. Find the high-surrogate that belongs to the low-surrogate, decode its codepoint and then create the proper utf-8 sequence for the codepoint.

answered Jan 15 '14 at 12:43

brighty

406
3
10

Thank you brighty. What is a DCC3? I've tried to search for what it might be, but I see nothing that seems relevant. I don't understand the rest of the answer either, but hopefully I'll be able to make sense of it after I learn what DCC3 is. Thanks. – dotancohen Jan 15 '14 at 12:59
Ok, i'll explain it. Regarding utf-16 we talk about a stream of words, okay. A word has 16 bits, that's why there is the name utf-16. Now words in the range U+DC00 bis U+DFFF are Low-Surrogates, words in the range U+D800 bis U+DBFF are High-Surrogates. That's why 0xDCC3 is a Low-Surrogate, understand? – brighty Jan 18 '14 at 11:37
Due to the given range, the first 6 bits of a surrogate identifies it as surrogate. The remaining 10 bits are 50% of the encoded codepoint value of the character. The other 50% of the codepoints character resides usually within the high-surrogate. That's why usually a high- and low-surrogate appears as a pair. To the 10 bits of the high-surrogate and the 10 bits of the low-surrogate which results in 20 bits the constant 0x10000 is added and then you'll have the codepoint with a maximum of 21 bits (up to 0x10ffff). – brighty Jan 18 '14 at 11:39
That's why a single high/low surrogate is just a container, used to encode a codepoint higher than 0xffff as a pair, becoming a DWORD with the encoded codepoint inside. In utf-32 you have 32 bits so no surrogates are needed and so not allowed, in utf-8 you'll decode codepoints higher than 127 in 2-, 3- or 4-byte sequences so surrogates are disallowed in utf-8 as well. – brighty Jan 18 '14 at 11:39
Conclusion: If in a word stream - and that is utf-16 - you'll have to place a character with a codepoint higher than 0xffff you get into trouble, because a word's range stops at 0xffff. Okay!? So the solution is to take 2 words. But how can we identify that the two words are a pair and decode a codepoint higher than 0xffff? The answer is that the unicode inventors invented "reserved" blocks of words, the so called high- and low-surrogates that must appear as a pair within a utf-16 stream, encoding a codepoint higher than 0xffff. Hope this helps. – brighty Jan 18 '14 at 11:52
The low-surrogate DCC3 is dual 1101110011000011, so 110111 identifies it as a low-surrogate, 0011000011 are 10 bits payload bits that are part of the character's codepoint. In order to get the complete codepoint, you'll need the high-surrogate's 10 payload bits as well. Remember surrogates appear as a pair, with just your low-surrogate DCC3 you cannot decode the original codepoint. Think about that in utf-16 a word is a child having the codepoints number up to 0xffff, for everything higher than 0xffff you'll see twins, 2 kids that together know the codepoints number. – brighty Jan 18 '14 at 12:04

"surrogateescape" cannot escape certain characters

3 Answers3

Linked