2

I'm trying to figure out a way to deal with special characters that can not be found in the standard Ascii chart. I'm doing some translation poetry to become familiar with the httplib and urllib modules. The problem is when translating to from language to another with a different alphabet, meaning some phrases from English to Spanish/French to English work but only if I choose my words wisely ahead of time to avoid any conflict (defeats the purpose). Pardon the strange sentence I pass, I don't exactly have a way with charming words.

import httplib, urllib, json
connObj = httplib.HTTPConnection("api.mymemory.translated.net")
def simpleTrans(conn, text, ln1, ln2):
    paramDict = {'q': text,
                 'langpair':ln1+"|"+ln2}
    params = urllib.urlencode(paramDict)
    conn.request("GET","/get?"+params)
    res = connObj.getresponse()
    serializedText = res.read()
    responseDict = json.loads(serializedText)
    return responseDict['responseData']['translatedText']


a = simpleTrans(connObj, "man eats dogs for the sake of poetry police give him ten years in jail", 'en', 'fr')
b = simpleTrans(connObj, a, 'fr', 'es')
c = simpleTrans(connObj, b, 'es', 'no')
print (simpleTrans(connObj, c, 'no', 'en'))

Which yields the following error as expected.

bash-3.2$ python translationPoetry.py 
Traceback (most recent call last):
  File "translationPoetry.py", line 15, in <module>
    b = simpleTrans(connObj, a, 'fr', 'es')
  File "translationPoetry.py", line 6, in simpleTrans
    params = urllib.urlencode(paramDict)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 1294, in urlencode
**UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 54: ordinal not in range(128)**

If anyone could bounce some ideas for me, I'd be very grateful!

Monte Carlo
  • 447
  • 2
  • 6
  • 14
  • Change `return responseDict['responseData']['translatedText']` to `return responseDict['responseData']['translatedText'].encode('utf-8')` and see if that helps. – Blender Apr 30 '13 at 02:54
  • Worked like a charm, going to do more research into this. Thank you so much. – Monte Carlo Apr 30 '13 at 03:06

1 Answers1

0

ASCII is a limited character set as all the characters need to be represented in 8 Bits. I suggest you to have a look at Unicode. Unicode is a standard format and it has the capability to represent more than just English vocabulary.

You can start here.

Also have a look at the function decode().

st = 'ASCII character string.'
st.decode('utf-8')
csurfer
  • 296
  • 1
  • 7