You can use non-ASCII characters in Python, but you have to tell Python the encoding of your source file with a #coding
statement. Make sure to save the source in the encoding declared. It is also good practice to do all text processing in Unicode:
#!python2
#coding:utf8
line = u'This is a ʃɐ͂ẽ test'
line = line.replace (u'ʃ',u' sh ')
line = line.replace (u'ɐ͂',u' an ')
line = line.replace (u'ẽ',u' en ')
print line
Output:
This is a sh an en test
Note that ɐ͂ is actually two Unicode codepoints ɐ
(U+0250) and a combining codepoint of U+0342 COMBINING GREEK PERISPOMENI. The ẽ
can be represented either as a single codepoint U+1EBD LATIN SMALL LETTER E WITH TILDE, or as two codepoints U+0065 LATIN SMALL LETTER E and U+0303 COMBINING TILDE. To make sure you are using single combined codepoints or decomposed characters the unicodedata
module can be used:
import unicodedata as ud
line = ud.normalize('NFC',line) # combined.
line = ud.normalize('NFD',line) # decomposed.
There is also NFKD and NFKC. See the Unicode standard for details on which is best for you.
If you are reading from a file, use io.open
and specify the encoding of the file to automatically convert the input to Unicode:
with io.open('data.txt','r',encoding='utf8') as f:
with line as f:
# do something with Unicode line.