Python - codec encoding ascii to unicode: error

Question

:) I am trying to go about the process of reversing transliteration of an input file(currently in english) back to its original form(in hindi)

A sample or a part of the input file looks like this:

E-k- b-u-d-z*dhi-m-aan- p-ksii#

E-k- ghn-e- j-ngg-l- m-e-ng E-k- b-h-u-t- UUNNc-aa p-e-dr thaa#
U-s- k-ii p-t-z*t-o-ng s-e- l-d-ii shaakhaay-e-ng m-j-*zb-uut- b-aaj-u-O-ng k-ii t-r-h- pheil-ii h-u-II thiing#
w-n- h-NNs-o-ng k-aa E-k- jhu-nhz*D- I-s- p-e-dr p-r- n-i-w-aas- k-r-t-aa thaa#
w-e- s-b- y-h-aaNN s-u-r-ksi-t- the- AUr- b-dre- AAr-aam- s-e- r-h-t-e- the-#
U-n- m-e-ng s-e- E-k- p-ksii b-h-u-t- b-u-d-z*dhi-m-aan- thaa#
I-s- b-u-d-z*dhi-m-aan- p-ksii n-e- E-k- d-i-n- p-e-dr k-ii j-dr m-e-ng s-e- E-k- l-t-aa k-o- U-g-t-e- d-e-khaa# 
I-s- k-e- b-aar-e- m-e-ng U-s-n-e- d-uus-r-e- p-ksi-y-o-ng s-e- b-aat- k-ii#
"k-z*y-aa t-u-m-z*h-e-ng w-h- l-t-aa d-i-khaaII d-e-t-ii h-ei", U-s- n-e- U-n- s-e- p-uuchaa "t-u-m-z*h-e-ng I-s-e- n-Shz*T- k-r- d-e-n-aa c-aah-i-E-"#
"I-s-e- k-z*y-o-ng n-Shz*T- k-r- d-e-n-aa c-aah-i-E-?" h-NNs-o-ng n-e- AAshz*c-*ry- s-e- p-uuchaa "y-h- t-o- I-t-n-ii cho-T-ii s-e- h-ei#
h-m-e-ng y-h- k-z*y-aa h-aan-i- p-h-u-NNc-aa s-k-t-ii h-ei"#
"m-e-r-e- m-i-tro-ng," b-u-d-z*dhi-m-aan- p-ksii n-e- U-t-z*t-r- d-i-y-aa "w-h- cho-T-ii s-ii l-t-aa j-l-z*d-ii h-ii b-drii h-o- j-aay-e-g-ii#
y-h- h-m-aar-e- p-e-dr p-r- c-Dh*z k-r- U-s- s-e- l-i-p-T-t-ii j-aay-e-g-ii AUr- phi-r- m-o-T-ii AUr- m-j-*zb-uut- h-o- j-aay-e-g-ii"#
"t-o- k-z*y-aa h-u-AA"#

Its equivalent meaning in english is:

A WISE OLD BIRD.

Deep in the forest stood a very tall tree.
Its leafy branches spread out like long arms.
This was the home of a flock of wild geese.
They were safe there.
One of the geese was a wild old bird.
One  day this wise old bird noticed  a small creeper growing at the foot of the tree.
He spoke to the other birds about it.
"Do you see that creeper ?" he said to them.
"You must destroy it."
"Why must we destroy it ?" asked the geese in surprise.
"It is so small.
What harm can it do?"
"My friends," replied the wise old bird, " that little creeper will soon grow.

My script looks like this:

#!/usr/bin/python
# -*- coding: UTF-8 -*-
import sys
CODEC = 'utf-8'
input_file=sys.argv[1]
output_file=sys.argv[2]
list1=[]



f=open(input_file,'r')
f1 = open(output_file,'w')

english_hindi_dict={'A' : u'अ' ,  'AA' : u'आ ' , 'I' : u'इ' , 'II' : u'ई ' , 'U' : u'उ ' ,\
                'UU' : u'ऊ' , 'r' : u'ऋ' , 'E' : u'ए' , 'ai' : u'ऐ' , 'O' : u'ओ' , 'AU' : u'औ' ,\
                'k' : u'क' , 'kh' : u'ख' , 'g' : u'ग' , 'gh' : u'घ' , 'c' : u'च' , 'ch' : u'छ',\
                'j': u'ज' , 'jh' : u'झ' , 'tr' : u'त्र' , 'T' : u'ट'  , 'Th' : u'ठ' , 'D' : u'ड',\
                'dr' : u'ड' , 'Dh' : u'ढ' , 'Na' : u'ण' , 'th' : u'त' ,  'tha' : u'थ',\
                'd' : u'द' , 'dh': u'ध' , 'n' : u'न' , 'p' : u'प' , 'ph' : u'फ' ,\
                'b' : u'ब' , 'bh' : u'भ' , 'm' : u'म' , 'y' : u'य' , 'r' : u'र' , 'l' : u'ल' ,\
                'w' : u'व' , 'sh' : u'श' , 'sha' : u'ष', 's' : u'स' , 'h' : u'ह' , 'ks' : u'क्ष' ,\
                'i' : u'ि' , 'ii' : u'ी' , 'u' : u'ु' , 'uu' : u'ू' , 'e' : u'े' ,\
                'aa' : u'ै' , 'o' : u'ो' , 'AU' : u'ौ' ,'H' : u'्' ,'mn' : u'ं' ,\
                'NN' : u'ँ' , 'AW' : u'ॅ' , 'rr' : u'ृ' , '4' : u'४' , '6': u'६'  , '8' : u'८',\
                '2' : u'२' , '5' : u'५' , '3' : u'३' , '7' : u'७' , '9' : u'९' , '1' : u'१'}
for line in f:
      #line=line.strip() to remove a line from its newline character....  
      #line=line.rstrip('.')   
      line=line.replace('-','')
      line=line.replace('#','|') # i am using the or symbol for poornviram
      #line=line.replace('।','')
      #line = line.lower()
for word in line:
    for ch in word:
        if (ch in english_hindi_dict) :
            translatedToken = english_hindi_dict[ch]
        else :
                translatedToken = ch

#{ translatedToken = english_hindi_dict[ch] }

#for ch in line:
    f1.write(translatedToken)
    #print translatedToken
    #line = line.replace( char,english_hindi_dict[char] )   
      #list1.append(line)
f.close()

f1.write(' '.join(list1))

f1.close()

the error that I am getting is:

python transliterate_eh_nw.py Hstory.txt op1.txt
Traceback (most recent call last):
  File "transliterate_eh_nw.py", line 43, in <module>
    f1.write(translatedToken)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u092f' in position 0: ordinal not in range(128)

Could you please tell me how do I deal with this error. Thank you..:)

Working with only unicode is the only thing to do this right, see other answers. I think you need some kind of ordering, for example you want to be sure you do the 'AA' substitution before you do 'A'. — u0b34a0f6ae, Feb 15 '10 at 10:40
Accepted Sir.. sorry for the delay.. hope you don't misunderstand Thank you..:) — boddhisattva, Mar 05 '10 at 15:28
Thanks for the acceptance. I understand that you have been busy. I am interested in what the transliteration scheme actually is, and I'm curious about answers to the questions that I raised about possible anomalies in your transliteration dictionary, and can I be of any further help. If you're not busy, perhaps you'd like to communicate by private e-mail ... replace_spaces_by_appropriate_punctuation("sjmachin lexicon net") — John Machin, Mar 06 '10 at 00:13
@JohnMachin I'm sorry, I somehow missed your earlier message. I've dropped you a note at the email id that you've shared to discuss this further. — boddhisattva, Dec 25 '16 at 09:43

John Machin · Accepted Answer · 2010-02-21T12:47:14.953

You have a few problems other than the one which you asked about.

(1) A conceptual problem: "E-k- b-u-d-z*dhi-m-aan- p-ksii#" is not "english". It is Hindi language written in ASCII using some romanization scheme. It looks like ITRAN but ITRAN doesn't have AA and A, it has only aa and a. Does the scheme have a name? Can you supply a URL? Your object is better described as "transliterate some Hindi text from the unnamed romanization to Devanagari script".

(2) Showing the result of translating your text from Hindi to English ("A WISE OLD BIRD" etc) is only moderately useful. The expected Devanagari output would be a better idea.

(3) As remarked by @kaiser.se, the transliteration dictionary has multi-byte (up to 3 bytes!) keys, some of which are prefixes of others. Presumably AA must be recognised in priority to A, gh must be recognised before g, etc. Iterating over the items of a dictionary happens in an order that is predictable but for your purposes should be regarded as random. In the code that follows, I've given priority to longer "keys".

(4) Either the dictionary is missing some letter keys (a S t z) or the transliteration rules are more complicated than any of us has guessed so far

(5) The meaning of the characters # * and - is not 100% obvious. It appears from your input text that z and * appear only in combination as z*

(6) It would be a good idea if you explained the interpretation of e.g. shaakhaay-e-ng ... does it start with sh then aa or does it start with sha then a? What are the rules?

The answer to the problem that you asked about is of course as several others have pointed out that you need to encode your unicode output using an encoding that is supported by your display device e.g. UTF-8.

Here's some code:

#!/usr/bin/python
# -*- coding: UTF-8 -*-

input_data = """
E-k- b-u-d-z*dhi-m-aan- p-ksii#

E-k- ghn-e- j-ngg-l- m-e-ng E-k- b-h-u-t- UUNNc-aa p-e-dr thaa#
[snip]
"t-o- k-z*y-aa h-u-AA"#
"""

roman_devanagari_dict={'A' : u'अ' ,  'AA' : u'आ ' , 'I' : u'इ' , 'II' : u'ई ' , 'U' : u'उ ' ,\
[snip]
            '2' : u'२' , '5' : u'५' , '3' : u'३' , '7' : u'७' , '9' : u'९' , '1' : u'१'}

#Presuming we need to do the 3-letter cases then the 2-letter then the 1-letter
replacements = [(-len(k), unicode(k), v) for k, v in roman_devanagari_dict.items()]
replacements.sort()

data = input_data.decode('ascii')

for _junk, from_text, to_text in replacements:
    data = data.replace(from_text, to_text)

# Presuming the '-' are inter-character markers, delete them last, not first
data = data.replace(u'-', '')
data = data.replace(u'#', '')
print "untransliterated:", set(c for c in data if 0x20 < ord(c) < 0x7f)

BOM = u'\ufeff'
outf = open('devanagari.txt', 'w')
outf.write(BOM.encode('utf8')) # for the benefit of clueless Windows s/w
outf.write(data.encode('utf8'))
outf.close()

Output:

एक बुदz*धिमैन पक्षी

एक घने जनगगल मेनग एक बहुt ऊँचै पेड थa उ स की पtztोनग से लदी षaखैयेनग मजzबूt बैजुओनग की tरह फेिली हुई तीनग वन हँसोनग कै एक झुनहzड इस पेड पर निवैस करtै थa वे सब यहैँ सुरक्षिt ते ौर बडे आ रैम से रहtे ते उ न मेनग से एक पक्षी बहुt बुदzधिमैन थa इस बुदzधिमैन पक्षी ने एक दिन पेड की जड मेनग से एक लtै को उ गtे देखै इस के बैरे मेनग उ सने दूसरे पक्षियोनग से बैt की "कzयै tुमzहेनग वह लtै दिखैई देtी हेि", उ स ने उ न से पूछै "tुमzहेनग इसे नSहzट कर देनै चैहिए" "इसे कzयोनग नSहzट कर देनै चैहिए?" हँसोनग ने आ शzचरय से पूछै "यह tो इtनी छोटी से हेि हमेनग यह कzयै हैनि पहुँचै सकtी हेि" "मेरे मित्रोनग," बुदzधिमैन पक्षी ने उ tztर दियै "वह छोटी सी लtै जलzदी ही बडी हो जैयेगी यह हमैरे पेड पर चढz कर उ स से लिपटtी जैयेगी ौर फिर मोटी ौर मजzबूt हो जैयेगी" "tो कzयै हुआ "

which has only a few recognisable words when shoved through Google Translate.

Update after examining the transliteration table more closely:

Three of the entries (AA, II, and U) have a space after the Devanagari equivalent. Perhaps the spaces should be removed.
The general pattern for consonants appears to be:

DEVANAGARI LETTER XA is represented by x
DEVANAGARI LETTER XXA is represented by X
DEVANAGARI LETTER XHA is represented by xh
DEVANAGARI LETTER XXHA is represented by Xh

However 3 entries break the pattern:
SSA -> sha but pattern says S
TA -> th but pattern says t
THA -> tha but pattern says th

Note: changing the above 3 entries stopped my code from complaining that S and t were left unchanged when transliterating your sample text, and removed the seemingly-anomalous sha and tha entries.

Entries (D and dr) are mapped to the same character, DEVANAGARI LETTER DDA. D is the expected entry for that character; perhaps dr should be mapped elsewhere.
There is no entry for DEVANAGARI LETTER NGA (U+0919); perhaps it should be encoded as ng -- there are a few words ending in ng in the sample text.
Are the uncatered-for "z*" occurrences in the sample text anything to do with DEVANAGARI LETTER ZA (U+095B)?

Hi John...:) First of all many thanks for your valuable time and help, I am sorry about not yet accepting any of the answers till now, I have been a bit held up recently with related work will surely get back to you asap, hope you understand.... — boddhisattva, Feb 21 '10 at 23:46
@john wonder if you can help with this http://stackoverflow.com/questions/41079364/python-get-unicode-from-devnagari-character — user2661518, Dec 10 '16 at 19:50
@user2661518 Do you mean the question that you have just deleted? — John Machin, Dec 10 '16 at 20:23
@JohnMachin sorry I kind of found answer and undeleted and here is main question I'm getting at http://stackoverflow.com/questions/41079958/join-devanagari-words-incorrectly-extracted-from-pdfminer — user2661518, Dec 10 '16 at 20:27

score 1 · Answer 2 · answered Feb 15 '10 at 10:43

f1.write(' '.join(list1))

list1, at this point, contains Unicode strings. You can't write Unicode directly to a file, it's a byte interface. You should either encode it explicitly (' '.join(list1).encode('utf-8')), or, as Ignacio suggests, use a codecs wrapper to implicitly encode Unicode strings you send to it. At the moment you are defining a variable CODEC, but not doing anything with it.

score 1 · Answer 3 · answered Feb 15 '10 at 19:09

Are you sure you want to remove all the hyphens(-)? Looking at your input file, it looks like all replacements are two- or three-character codes, such as u'I-':u'इ'. If this is so, you could do something like below, but make sure you're using Unicode strings for all your keys and values in the dictionary:

import codecs

# read the whole file at once
f = codecs.open(input_file,'r','ascii')
data = f.read()
f.close()

# perform all the replacements
for k,v in english_hindi_dict.items():
    data = data.replace(k,v)

# write the whole file result
f = codecs.open(output_file,'w',CODEC)
f.write(data)
f.close()

Following that theory, I got the following result, which looks like translations such as 'z*', 't-', 'ng', and 'ei' are missing from the dictionary. I don't read Hindi, but Google Translate came up with some of the English words in your translation, so I think I'm on the right track.

-z*धिमैन पक्षी

एक घने जngगल मेng एक बहुt- ऊँचै पेड तै
उस की पt-z*t-ोng से लदी शैखैयेng मज*zबूt- बैजुओng की t-रह फeiली हुई तीng
वन हँसोng कै एक झुnhz*ड इस पेड पर निवैस करt-ै तै
वे सब यहैँ सुरक्षिt- ते ौर बडे आरैम से रहt-े ते
उन मेng से एक पक्षी बहुt- बुदz*धिमैन तै
इस बुदz*धिमैन पक्षी ने एक दिन पेड की जड मेng से एक लt-ै को उगt-े देखै 
इस के बैरे मेng उसने दूसरे पक्षियोng से बैt- की
"कz*यै t-ुमz*हेng वह लt-ै दिखैई देt-ी हei", उस ने उन से पूछै "t-ुमz*हेng इसे नShz*ट कर देनै चैहिए"
"इसे कz*योng नShz*ट कर देनै चैहिए?" हँसोng ने आशz*च*rय से पूछै "यह t-ो इt-नी छोटी से हei
हमेng यह कz*यै हैनि पहुँचै सकt-ी हei"
"मेरे मित्रोng," बुदz*धिमैन पक्षी ने उt-z*t-र दियै "वह छोटी सी लt-ै जलz*दी ही बडी हो जैयेगी
यह हमैरे पेड पर चढ*z कर उस से लिपटt-ी जैयेगी ौर फिर मोटी ौर मज*zबूt- हो जैयेगी"
"t-ो कz*यै हुआ"

Thank you for your answer Sir.. I will get back to you Sir in a some time.. am I kinda bit held up... — boddhisattva, Feb 21 '10 at 23:47

Python - codec encoding ascii to unicode: error

3 Answers3

Linked