0

I'm trying to write unicode strings to a file in Python but when I read the file using linux "cat" or "less" the correct characters are not written, instead they show up as garbage.

I am reading the object from an Oracle database. When I print the type (where a is a row in the database results):

logger.debug(type(a[index])) 

it outputs:

<type 'unicode'>

I open the file for writing like so:

ff = codecs.open(filename, mode='w', encoding='utf-8')

and I write the line to the file like:

ff.write(a[index]))

but when I read the output file, it doesn't show the correctly accented characters but garbage instead:

$Bu��rger, Udo, -1985. Way to perfect horsemanship

How do I correctly write unicode string objects to a file in Python?

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
user45183
  • 529
  • 1
  • 7
  • 16
  • 2
    How are you reading the file? Does whatever software you're using to read it know that it should be reading it as UTF-8? What is the encoding of your linux terminal? – BrenBarn May 24 '14 at 21:41
  • What does `logger.debug(repr(a[index]))` write? You appear to have a mighty big [Mojibake](http://en.wikipedia.org/wiki/Mojibake) there. – Martijn Pieters May 24 '14 at 21:44
  • 1
    If I encode that text to Latin-1 then decode again as UTF-8 I get `u'$Bu\ufffd\ufffdrger, Udo, -1985. Way to perfect horsemanship\n'`. Not quite legible but does indicate you used `error='replace'` here when you decoded something with the *wrong* encoding. U+FFFD is the default replacement character when you use that error handling. – Martijn Pieters May 24 '14 at 21:45
  • Sounds like `a[index]` contains two instances of the unicode replacement character (http://www.fileformat.info/info/unicode/char/0fffd/index.htm), which is getting correctly encoded as utf-8. However, your terminal is reading using latin-1 instead of utf-8. The terminal problem should be easy to fix, figuring out why you have replacement characters I would expect to be harder. Unless that's what you expect to be there, I guess. – Peter DeGlopper May 24 '14 at 21:51
  • echo $LANG reports en_US.UTF-8. I only used the terminal output as an example though. The problem still shows up when I subsequently open and read the file using `file = codecs.open(filename, 'r', encoding='utf-8')` and `file.readline()`. I output what's returned from readline using Django's render_to_response and in the html output garbage shows up. – user45183 May 24 '14 at 21:54
  • Can you show us the `repr()` of the data as you write it to the file? You'll have to do this at various points in your code, really; where you receive the data from Oracle for example. By the time you are writing it to the file it is almost certainly corrupted. – Martijn Pieters May 24 '14 at 23:25

1 Answers1

2

I can guess at how you arrived at that Mojibake of a string. It is quite involved, I am impressed how mucked up this got to be.

Something decoded text from bytes to Unicode with error='replace', masking the fact the wrong codec was used as as bytes that weren't recognized were replaced with replacement characters.

The resulting Unicode text with U+FFFD REPLACEMENT CHARACTER codepoints was then encoded to UTF-8, but decoded them again as Latin 1, most likely by your terminal as cat or les output the raw bytes.

The text encoded this way is:

>>> print u'$Bu��rger, Udo, -1985. Way to perfect horsemanship'.encode('latin1').decode('utf8')
$Bu��rger, Udo, -1985. Way to perfect horsemanship

Presumably this was meant to be Bürger, Udo, - 1985. Way to perfect horsemanship, with the ü being formed by the character u and the U+0308 COMBINING DIAERESIS codepoint, which would have been CC 88 in UTF-8, but not decodable as ASCII:

>>> text = u'Bu\u0308rger, Udo, - 1985. Way to perfect horsemanship'
>>> print text
Bürger, Udo, - 1985. Way to perfect horsemanship
>>> text.encode('utf8')
'Bu\xcc\x88rger, Udo, - 1985. Way to perfect horsemanship'
>>> text.encode('utf8').decode('ascii', errors='replace')
u'Bu\ufffd\ufffdrger, Udo, - 1985. Way to perfect horsemanship'

The moral of the story: Don't use errors='replace' unless you are absolutely sure what you are doing.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • Hi Martijn, thanks for the response. This is what repr outputs just after I receive the data from Oracle (before I operate on it at all): `u'Bu\ufffd\ufffdrger, Udo, -1985....'`. I haven't used `errors='replace'` anywhere in my code. And I'm pretty sure when I saw the DBA run the query while using the Oracle client it displayed correctly so it is in the database correctly. I am using the cx_Oracle driver and Django's raw sql methods to query the database. I am wondering if cx_Oracle or Django is responsible for mucking it up somehow and inserting the U+FFFD replacement character. Thanks. – user45183 May 25 '14 at 05:58
  • Right. What are the versions of Django, Oracle, the Oracle Client libraries, and cx_Oracle? What, if anything, is the NLS_LANG environment variable set to? If you import `cx_Oracle` into a python session, does `cx_Oracle.UNICODE` exist? – Martijn Pieters May 25 '14 at 09:21
  • Django is 1.6.4 and cx_Oracle is 5.1.2. Our DBA reports that Instant Client is 11.2.0.4 and Oracle on the server is 10.2.0.4 and that NLS_LANG is set to AMERICAN.AMERICA.US7ASCII. When I import cx_Oracle, yes, cx_Oracle.UNICODE exists. So is the problem that NLS_LANG is not set to American_America.UTF8 on the server and client? Thanks. – user45183 May 25 '14 at 17:13
  • @user45183: Yes, the problem is `NLS_LANG` here; set it to something matching the UTF-8 data in the database. – Martijn Pieters May 25 '14 at 17:31
  • @user45183: Note that on the client side, [Django already sets `NLS_LANG` to `'.UTF8'`](https://github.com/django/django/blob/master/django/db/backends/oracle/base.py#L34) to force the issue. – Martijn Pieters May 25 '14 at 21:41