12

I have been working on ways to flatten text into ascii. So ā -> a and ñ -> n, etc.

unidecode has been fantastic for this.

# -*- coding: utf-8 -*-
from unidecode import unidecode
print(unidecode(u"ā, ī, ū, ś, ñ"))
print(unidecode(u"Estado de São Paulo"))

Produces:

a, i, u, s, n
Estado de Sao Paulo

However, I can't duplicate this result with data from an input file.

Content of test.txt file:

ā, ī, ū, ś, ñ
Estado de São Paulo

# -*- coding: utf-8 -*-
from unidecode import unidecode
with open("test.txt", 'r') as inf:
    for line in inf:
        print unidecode(line.strip())

Produces:

A, A<<, A<<, A, A+-
Estado de SAPSo Paulo

And:

RuntimeWarning: Argument is not an unicode object. Passing an encoded string will likely have unexpected results.

Question: How can I read these lines in as unicode so that I can pass them to unidecode?

e h
  • 8,435
  • 7
  • 40
  • 58
  • 3
    Why is this "Unicode hell"? Those are perfectly good accented characters. Hell would be if they were mutilated beyond repair (which some may argue that your solution actually does). – tripleee Mar 20 '14 at 17:37
  • 4
    I agree. These are tip-top characters, and I feel very guilty for steamrolling them, but that is what I did. Good news is that I'll have time to think about it in ascii purgatory. – e h Mar 20 '14 at 17:58

2 Answers2

8

Use codecs.open

with codecs.open("test.txt", 'r', 'utf-8') as inf:

Edit: The above was for Python 2.x. For Python 3 you don't need to use codecs, the encoding parameter has been added to regular open.

with open("test.txt", 'r', encoding='utf-8') as inf:
Mark Ransom
  • 299,747
  • 42
  • 398
  • 622
5
import codecs
with codecs.open('test.txt', encoding='whicheveronethefilewasencodedwith') as f:
    ...

The codecs module provides a function to open files with automatic Unicode encoding/decoding, among other things.

user2357112
  • 260,549
  • 28
  • 431
  • 505