11

In Python, what is the best way to write to a UTF-8 encoded file with platform-dependent newlines? the solution would ideally work quite transparently in a program that does a lot of printing in Python 2. (Information about Python 3 is welcome too!)

In fact, the standard way of writing to a UTF-8 file seems to be codecs.open('name.txt', 'w'). However, the documentation indicates that

(…) no automatic conversion of '\n' is done on reading and writing.

because the file is actually opened in binary mode. So, how to write to a UTF-8 file with proper platform-dependent newlines?

Note: The 't' mode seems to actually do the job (codecs.open('name.txt', 'wt')) with Python 2.6 on Windows XP, but is this documented and guaranteed to work?

Eric O. Lebigot
  • 91,433
  • 48
  • 218
  • 260

3 Answers3

11

Presuming Python 2.7.1 (that's the docs that you quoted): The 'wt' mode is not documented (the ONLY mode documented is 'r'), and does not work -- the codecs module appends 'b' to the mode, which causes it to fail:

>>> f = codecs.open('bar.txt', 'wt', encoding='utf8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\python27\lib\codecs.py", line 881, in open
    file = __builtin__.open(filename, mode, buffering)
ValueError: Invalid mode ('wtb')

Avoid the codecs module and DIY:

f = open('bar.text', 'w')
f.write(unicode_object.encode('utf8'))

Update about Python 3.x:

It appears the codecs.open() has the same deficiency (won't write platform-specific line terminator). However built-in open(), which has an encoding arg, is happy to do it:

[Python 3.2 on Windows 7 Pro]
>>> import codecs
>>> f = codecs.open('bar.txt', 'w', encoding='utf8')
>>> f.write('line1\nline2\n')
>>> f.close()
>>> open('bar.txt', 'rb').read()
b'line1\nline2\n'
>>> f = open('bar.txt', 'w', encoding='utf8')
>>> f.write('line1\nline2\n')
12
>>> f.close()
>>> open('bar.txt', 'rb').read()
b'line1\r\nline2\r\n'
>>>

Update about Python 2.6

The docs say the same as the 2.7 docs. The difference is that the "bludgeon into binary mode" hack of appending "b" to the mode arg failed in 2.6 because "wtb" wasn't detected as as an invalid mode, the file was opened in text mode, and appears to work as you wanted, not as documented:

>>> import codecs
>>> f = codecs.open('fubar.txt', 'wt', encoding='utf8')
>>> f.write(u'\u0a0aline1\n\xffline2\n')
>>> f.close()
>>> open('fubar.txt', 'rb').read()
'\xe0\xa8\x8aline1\r\n\xc3\xbfline2\r\n' # "works"
>>> f.mode
'wtb' # oops
>>>
John Machin
  • 81,303
  • 11
  • 141
  • 189
  • @John: Thanks. The Python 3 approach is great. As for the Python 2 approach, it is a real pain for programs that contain a lot of output. It is strange that codecs.open() works with 'wt' with my Python 2.6 on Windows… – Eric O. Lebigot May 10 '11 at 09:34
  • @EOL: See my latest update re 2.6. Perhaps you could avoid "real pain" by funneling all output through a wrapper ... – John Machin May 10 '11 at 11:19
  • 1
    @John: It would be nice to mention `f = codecs.getwriter("utf-8")(f)` to your solution for Python 2, as the question is for "a program that does a lot of printing", so that automatic encoding upon printing is useful. – Eric O. Lebigot May 17 '11 at 20:08
  • @John: do you see any reason to use `codecs.open()` instead of `open()`, in Python 3 (except for backward compatibility reasons)? – Eric O. Lebigot May 17 '11 at 20:13
  • @EOL: Answers to both questions: codecs module is stuffed (won't do platform-dependant line termination) so (1) why "nice to mention"?? (2) useless backwards compatible with useless – John Machin May 17 '11 at 21:52
  • @John: Regarding (1), one can do your `f = open('bat.txt', 'w')`, but followed by `f = codecs.getwriter('utf-8')(f)`; this yields both a correct encoding and proper line terminations, while allowing users to directly print (unicode) strings to the file. This is convenient in programs that "do a lot of printing". – Eric O. Lebigot May 18 '11 at 12:41
  • @John: There is a small problem with encoding before writing to a text file: if the encoding creates `\n` bytes as part of the encoding of some characters, then the text file is incorrectly encoded. This is not a problem for UTF-8, but can be a problem for other encodings (maybe some CJK encodings?). So, only Python 3 seems to offer the only simple and robust solution to the general question asked originally, no? – Eric O. Lebigot May 18 '11 at 13:12
  • @EOL: All sensibly-designed encodings (includes all CJK encodings that I'm aware of) avoid multi-byte sequences that contain 0x00 to 0x1f or 0x7f bytes. That leaves only UTF-16 and UTF-32 as problematic – John Machin Jun 01 '11 at 21:43
4

Are you looking for os.linesep? http://www.python.org/doc//current/library/os.html#os.linesep

thule
  • 4,034
  • 21
  • 31
0

In Python 2, why not encode explicitly?

with open('myfile.txt', 'w') as f:
    print >> f, some_unicode_text.encode('UTF-8')

Both embedded newlines, and those emitted by print, will be converted to the appropriate platform newline.

Marius Gedminas
  • 11,010
  • 4
  • 41
  • 39
  • A loot of printing is done in the program, so I would love to see a lightweight solution. Also, I'm wondering whether it is guaranteed that text mode does not truncate to 7 bits on some platforms… – Eric O. Lebigot May 10 '11 at 13:05