Python: solving unicode hell with unidecode

Question

I have been working on ways to flatten text into ascii. So ā -> a and ñ -> n, etc.

unidecode has been fantastic for this.

# -*- coding: utf-8 -*-
from unidecode import unidecode
print(unidecode(u"ā, ī, ū, ś, ñ"))
print(unidecode(u"Estado de São Paulo"))

Produces:

a, i, u, s, n
Estado de Sao Paulo

However, I can't duplicate this result with data from an input file.

Content of test.txt file:

ā, ī, ū, ś, ñ
Estado de São Paulo

# -*- coding: utf-8 -*-
from unidecode import unidecode
with open("test.txt", 'r') as inf:
    for line in inf:
        print unidecode(line.strip())

Produces:

A, A<<, A<<, A, A+-
Estado de SAPSo Paulo

And:

RuntimeWarning: Argument is not an unicode object. Passing an encoded string will likely have unexpected results.

Question: How can I read these lines in as unicode so that I can pass them to unidecode?

Why is this "Unicode hell"? Those are perfectly good accented characters. Hell would be if they were mutilated beyond repair (which some may argue that your solution actually does). — tripleee, Mar 20 '14 at 17:37
I agree. These are tip-top characters, and I feel very guilty for steamrolling them, but that is what I did. Good news is that I'll have time to think about it in ascii purgatory. — e h, Mar 20 '14 at 17:58

Mark Ransom · Accepted Answer · 2020-09-16T14:42:15.243

8

with codecs.open("test.txt", 'r', 'utf-8') as inf:

Edit: The above was for Python 2.x. For Python 3 you don't need to use codecs, the encoding parameter has been added to regular open.

with open("test.txt", 'r', encoding='utf-8') as inf:

edited Sep 16 '20 at 14:42

answered Mar 20 '14 at 17:16

Mark Ransom

score 5 · Answer 2 · answered Mar 20 '14 at 17:15

5

import codecs
with codecs.open('test.txt', encoding='whicheveronethefilewasencodedwith') as f:
    ...

The codecs module provides a function to open files with automatic Unicode encoding/decoding, among other things.

answered Mar 20 '14 at 17:15

user2357112

Thanks. Both answers were perfect, went with Mark since he was first. – e h Mar 20 '14 at 18:56

2 Answers2