0

I have a file (downloaded from somewhere on the www) that is encoded in CP819, and want to read it then further handle the data in UTF-8. Tried all the examples I could find here and elsewhere, nothing worked.

The furthest I could get:

with codecs.open(INFIL, mode='rb',encoding='cp819') as INPUT:
 DUMMY=INPUT.readline()
 print (DUMMY)

which gave me

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe8' in position 5: ordinal not in range(128)

At offset 5 in the input file is the first character above ascii 128: the \xe8 is supposed to decode to 'è'.

Found a few pages concerning this error message, tried all suggestions I found, nothing helps.

Using python 2.7.6 on Ubuntu 14.04.1 LTS

falsetru
  • 357,413
  • 63
  • 732
  • 636
Karlchen9
  • 25
  • 5

1 Answers1

0

You can explicitly encode the unicode string using unicode.encode:

with codecs.open(INFIL, encoding='cp819') as f:
    line = line.readline()
    print line.encode('utf-8')

Another way is to invoke the python program with the environment PYTHONIOENCODING=utf-8 set.

PYTHONIOENCODING=utf-8 python /path/to/python_program.py
falsetru
  • 357,413
  • 63
  • 732
  • 636
  • Well, it was a good answer to my first issue, but I immediately came upon a new one. Not sure whether to continue here or to start a new question - the issue is still related to decoding an input file. – Karlchen9 Sep 21 '14 at 08:41
  • @Karlchen9, Please a post a separated question. – falsetru Sep 21 '14 at 08:42
  • @Karlchen9, post a question as an answer is not welcome in stackoverflow community. – falsetru Sep 21 '14 at 08:42
  • @Karlchen9, To answer your new question shortly, you should not use `codecs.open` if you are dealing with binary data, and decode manually for the text lines. – falsetru Sep 21 '14 at 08:45