1

I want to do a simple replace like:

line= line.replace ('ʃ',' sh ')
line= line.replace ('ɐ͂',' an ')
line= line.replace ('ẽ',' en ')

The problem is that python does not accept these characters.

I tried also tried things like:

line= line.replace (u'\u0283',' sh ')

but I still can't open anything because I get a decoding error: UnicodeDecodeError: 'ascii' codec can't decode byte 0xcb in position 0: ordinal not in range(128)

I messed around with codecs but I couldn't find anything suitable, maybe I am going down the wrong path?

badner
  • 768
  • 2
  • 10
  • 31

1 Answers1

2

You can use non-ASCII characters in Python, but you have to tell Python the encoding of your source file with a #coding statement. Make sure to save the source in the encoding declared. It is also good practice to do all text processing in Unicode:

#!python2
#coding:utf8
line = u'This is a ʃɐ͂ẽ test'
line = line.replace (u'ʃ',u' sh ')
line = line.replace (u'ɐ͂',u' an ')
line = line.replace (u'ẽ',u' en ')
print line

Output:

This is a  sh  an  en  test

Note that ɐ͂ is actually two Unicode codepoints ɐ (U+0250) and a combining codepoint of U+0342 COMBINING GREEK PERISPOMENI. The can be represented either as a single codepoint U+1EBD LATIN SMALL LETTER E WITH TILDE, or as two codepoints U+0065 LATIN SMALL LETTER E and U+0303 COMBINING TILDE. To make sure you are using single combined codepoints or decomposed characters the unicodedata module can be used:

import unicodedata as ud
line = ud.normalize('NFC',line) # combined.
line = ud.normalize('NFD',line) # decomposed.

There is also NFKD and NFKC. See the Unicode standard for details on which is best for you.

If you are reading from a file, use io.open and specify the encoding of the file to automatically convert the input to Unicode:

with io.open('data.txt','r',encoding='utf8') as f:
    with line as f:
        # do something with Unicode line.
Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251
  • this works but still leaves some characters out, for example the combined characters like ẽ. I would be interested in a complete solution but this is almost there. Thanks a bunch! – badner Nov 07 '15 at 19:45
  • @badner, if you have other requirements, they should be edited into your question. This satisfies the requirements as stated in the question. It sounds like you may need to normalize your Unicode string first. Provide explicit examples. – Mark Tolonen Nov 07 '15 at 19:46
  • that is fair enough. You did answer the question. Let me edit it then. – badner Nov 07 '15 at 19:49
  • @badner, I'ved edited the answer to show that your new examples can be replaced, but they do have some variation in whether the characters can be represented in one or two codepoints. Providing explicit examples of how it is failing for you with the exact error messages would help. – Mark Tolonen Nov 07 '15 at 20:40