2

In Python 2.7 running in Ubuntu this code:

f = open("testfile.txt", "w")
f.write("Line one".encode("utf-16"))
f.write(u"\r\n".encode("utf-16"))
f.write("Line two".encode("utf-16"))

produces the desired newline between the two lines of text when read in Gedit:

Line one
Line two

However, the same code executed in Windows 7 and read in Notepad produces unintelligible characters after "Line one" but no newline is recognized by Notepad. How can I write correct newline characters for UTF-16 in Windows to match the output I get in Ubuntu?

I am writing output for a Windows only application that only reads Unicode UTF-16. I've spent hours trying out different tips, but nothing seems to work for Notepad. It's worth mentioning that I can successfully convert a text file to UTF-16 right in the Notepad, but I'd rather have the script save the encoding correctly in the first place.

I_Ridanovic
  • 139
  • 1
  • 9
  • 1
    As a side note: `"Line one".encode("utf-16")` is kind of a silly thing to do. Normally, you only want to call `encode` on Unicode strings, so you'd do `u"Line one".encode("utf-16")`. (3.x enforces that; 2.7 doesn't.) I'm guessing in this case it's just an artifact of your toy example, not your real code. – abarnert Jun 18 '13 at 01:31
  • 2
    PS, in the future, instead of just saying "unintelligible characters", it's worth actually pasting them here. Seeing line 2 show up as `䰀椀渀攀 琀眀漀` would verify that you've got the off-by-one-byte problem but not the BOM problem, while `＀䳾椀渀攀 琀眀漀` would show that you have both problems, etc. – abarnert Jun 18 '13 at 01:51
  • 1
    Also, it might not be completely unintelligible. Someone who knows Chinese fluently and has a good imagination could probably interpret it as something poetic about Ben climbing into soup bowls from the Han, Ming, and Qing dynasties, and explain how that's a loose figurative translation of "Line two". :) – abarnert Jun 18 '13 at 01:58

1 Answers1

9

The problem is that you're opening the file in text mode, but trying to use it as a binary file.

This:

u"\r\n".encode("utf-16")

… encodes to '\r\0\n\0'.

Then this:

f.write('\r\0\n\0')

… converts the Unix newline to a Windows newline, giving '\r\0\r\n\0'.

And that, of course, breaks your UTF-16 encoding. Besides the fact that the two \r\n bytes will decode into the valid but unassigned codepoint U+0A0D, that's an odd number of bytes, meaning you've got a leftover \0. So, instead of L\0 being the next character, it's \0L, aka , and so on.

On top of that, you're probably writing a new UTF-16 BOM for each encoded string. Most Windows apps will actually transparently handle that and ignore them, so all you're practically doing is wasting two bytes/line, but it isn't actually correct.


The quick fix to the first problem is to open the file in binary mode:

f = open("testfile.txt", "wb")

This doesn't fix the multiple-BOM problem, but it fixes the broken \n problem. If you want to fix the BOM problem, you either use a stateful encode, or you explicitly specify 'utf-16-le' (or 'utf-16-be') for all writes but the first write.


But the easy fix, for both problems, is to use the io module (or, for older Python 2.x, the codecs module) to do all the hard work for you:

f = io.open("testfile.txt", "w", encoding="utf-8")
f.write("Line one")
f.write(u"\r\n")
f.write("Line two")
abarnert
  • 354,177
  • 51
  • 601
  • 671
  • Thanks for the detailed explanation. I ended up using the solution with codecs module. – I_Ridanovic Jun 18 '13 at 05:59
  • @I_Ridanovic: Unless you need pre-2.6 support, `io.open` is generally better than `codecs.open`. See [issue 8796](http://bugs.python.org/issue8796) for some of the ways. (Also, if you're using `codecs.open` on Windows and already having newline problems, see [issue 7262](http://bugs.python.org/issue7262) for how the behavior differs from the docs.) – abarnert Jun 18 '13 at 18:14