0

It seems i've run a problem with the encoding itself in where i need to pass Bing translation junks..

def _unicode_urlencode(params):
    if isinstance(params, dict):
        params = params.items()
    return urllib.urlencode([(k, isinstance(v, unicode) and v.encode('utf-8') or v) for k, v in params])

def _run_query(args):
        data = _unicode_urlencode(args)
        sock = urllib.urlopen(api_url + '?' + data)
        result = sock.read()
        if result.startswith(codecs.BOM_UTF8):
                result = result.lstrip(codecs.BOM_UTF8).decode('utf-8')
        elif result.startswith(codecs.BOM_UTF16_LE):
                result = result.lstrip(codecs.BOM_UTF16_LE).decode('utf-16-le')
        elif result.startswith(codecs.BOM_UTF16_BE):
                result = result.lstrip(codecs.BOM_UTF16_BE).decode('utf-16-be')
        return json.loads(result)

def set_app_id(new_app_id):
        global app_id
        app_id = new_app_id

def translate(text, source, target, html=False):
        """
        action=opensearch
        """
        if not app_id:
                raise ValueError("AppId needs to be set by set_app_id")
        query_args = {
                'appId': app_id,
                'text': text,
                'from': source,
                'to': target,
                'contentType': 'text/plain' if not html else 'text/html',
                'category': 'general'
        }
        return _run_query(query_args)
...
text = translate(sys.argv[2], 'en', 'tr')
HOST = '127.0.0.1'
PORT = 894
s = socket.socket()
s.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)
s.connect((HOST, PORT))
s.send("Bing translation: " + text.encode('utf8') + "\r");
s.close()

As you can see, if the translated text contains some turkish characters, the script fails to send the text to the socket..

Do you have any idea on how to get rid of this?

Regards.

jamall55
  • 127
  • 1
  • 3
  • 10

2 Answers2

2

Your problem is entirely unrelated to the socket. text is already a bytestring, and you're trying to encode it. What happens is that Python tries to converts the bytestring to a unicode via the safe ASCII encoding in order to be able to encode as UTF-8, and then fails because the bytestring contains non-ASCII characters.

You should fix translate to return a unicode object, by using a JSON variable that returns unicode objects.

Alternatively, if it is already encoding text encoded as UTF-8, you can simply use

s.send("Bing translation: " + text + "\r")
phihag
  • 278,196
  • 72
  • 453
  • 469
  • i added the translate code to the OP. I am not sure how to fix that. Can you explain it a bit more since i am newbie.. Thanks – jamall55 Jul 04 '13 at 23:59
  • @jamall55 The code you posted shows that most likely the JSON library is at fault. Since it is not in the standard library in 2.5 (you should really use a newer Python version, but I digress), which `json` library are you using here? And what don't you understand in this answer, i.e. what should I elaborate on? – phihag Jul 05 '13 at 00:20
  • i was able to get it done. It was all about encoding the string two times with the wrong one coming second. Thanks. – jamall55 Jul 05 '13 at 00:37
-1
# -*- coding:utf-8 -*-

 text = u"text in you language"
 s.send(u"Bing translation: " + text.encode('utf8') + u"\r");

This must work. text must be spelled in utf-8 encoding.

Vasiliy Stavenko
  • 1,174
  • 1
  • 12
  • 29
  • didn't work out. s.send(u"CNN Bing translation: " + text.encode('utf8') + u"\r"); UnicodeDecodeError: 'ascii' codec can't decode byte 0xc5 in position 2: ordinal not in range(128) – jamall55 Jul 04 '13 at 23:42
  • 1. what is your source file encoding? – Vasiliy Stavenko Jul 04 '13 at 23:45
  • -1 Apart from broken indentation in this answer, there is absolutely no reason why you would ever want to send unicode over a socket. – phihag Jul 04 '13 at 23:48
  • it's utf8. All i wanna do is send the translation junk that i got from bing through the socket.. – jamall55 Jul 04 '13 at 23:51
  • it can't be. `u'your languge string'` is equal to `unicode('your language string', 'encoding of your source file')`. Then you might be wanting to convert it to some encoding, `ustr.encode('utf-8')` – Vasiliy Stavenko Jul 04 '13 at 23:57