Regarding reading and writing text files in Python, one of the main Python contributors mentions this regarding the surrogateescape
Unicode Error Handler:
[surrogateescape] handles decoding errors by squirreling the data away in a little used part of the Unicode code point space. When encoding, it translates those hidden away values back into the exact original byte sequence that failed to decode correctly.
However, while opening a file and then attempting to write the output to another file:
input_file = open('someFile.txt', 'r', encoding="ascii", errors="surrogateescape")
output_file = open('anotherFile.txt', 'w')
for line in input_file:
output_file.write(line)
Results in:
File "./break-50000.py", line 37, in main
output_file.write(line)
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc3' in position 3: surrogates not allowed
Note that the input file is not ASCII. However, it transverses hundreds of lines that contain non-ASCII characters just fine before it throws the exception on one particular line. The output file must be ASCII and loosing some characters is just fine.
This is the line that is throwing the error when decoded as UTF-8:
'Zoë\'s Coffee House'
This is the hex encoding:
$ cat z.txt | hd
00000000 27 5a 6f c3 ab 5c 27 73 20 43 6f 66 66 65 65 20 |'Zo..\'s Coffee |
00000010 48 6f 75 73 65 27 0a |House'.|
00000017
Why might the surrogateescape
Unicode Error Handler be returning a character that is not ASCII? This is with Python 3.2.3 on Kubuntu Linux 12.10.