1

With python 2.7 I am reading as unicode and writing as utf-16-le. Most characters are correctly interpreted. But some are not, for example, u'\u810a', also known as unichr(33034). The following code code does not write correctly:

import codecs
with open('temp.txt','w') as temp:
    temp.write(codecs.BOM_UTF16_LE)     
    text = unichr(33034)  # text = u'\u810a'
    temp.write(text.encode('utf-16-le'))

But either of these things, when replaced above, make the code work.

  1. unichr(33033) and unichr(33035) work correctly.

  2. 'utf-8' encoding (without BOM, byte-order mark).

How can I recognize characters that won't write correctly, and how can I write a 'utf-16-le' encoded file with BOM that either prints these characters or some replacement?

philshem
  • 24,761
  • 8
  • 61
  • 127
  • Please define "incorrectly". What did you expect? What happens instead? – Pavel Anossov Sep 16 '13 at 12:24
  • When unichr(33033) and unichr(33035) are used, the output is the correct Han character. But when I write unichr(33034), trying to write 脊, I get garbled text. – philshem Sep 16 '13 at 12:28
  • What are you using to view the file? What bytes are written, and what bytes did you expect? – Wooble Sep 16 '13 at 12:38
  • I am using notepad++, which correctly views the character when it is pasted in. When writing the character with the above method, the hex values don't match. – philshem Sep 16 '13 at 13:01

3 Answers3

4

You are opening the file in text mode, which means that line-break characters/bytes will be translated to the local convention. Unfortunately the character you are trying to write includes a byte, 0A, that is interpreted as a line break and does not make it to the file correctly.

Open the file in binary mode instead:

open('temp.txt','wb')
Joni
  • 108,737
  • 14
  • 143
  • 193
1

@Joni's answer is the root of the problem, but if you use codecs.open instead it always opens in binary mode, even if not specified. Using the utf16 codec also automatically writes the BOM using native endian-ness as well:

import codecs
with codecs.open('temp.txt','w','utf16') as temp:
    temp.write(u'\u810a')

Hex dump of temp.txt:

FF FE 0A 81

Reference: codecs.open

Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251
0

You're already using the codecs library. When working with that file, you should swap out using open() with codecs.open() to transparently handle encoding.

import codecs
with codecs.open('temp.txt', 'w', encoding='utf-16-le') as temp:
    temp.write(unichr(33033))
    temp.write(unichr(33034))
    temp.write(unichr(33035))

If you have a problem after that, you might have an issue with your viewer, not your Python script.

Jordan
  • 31,971
  • 6
  • 56
  • 67
  • Looks good, but if I take your code (and add the "as temp" to the open line), I get the following error: UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128). – philshem Sep 16 '13 at 12:56
  • 1
    Thanks for the correction. I noticed that I used underscores instead of dashes for the encoding. Please try again. – Jordan Sep 16 '13 at 13:01
  • You shouldn't be writing the BOM explicitly, just open the file with encoding="'utf-16' and the BOM will be written for you. See this answer for an explanation http://stackoverflow.com/a/5726295/107660 – Duncan Sep 16 '13 at 14:05