Python Requests URL with Unicode Parameters

Question

I'm currently trying to hit the google tts url, http://translate.google.com/translate_tts with japanese characters and phrases in python using the requests library.

Here is an example:

http://translate.google.com/translate_tts?tl=ja&q=ひとつ

However, when I try to use the python requests library to download the mp3 that the endpoint returns, the resulting mp3 is blank. I have verified that I can hit this URL in requests using non-unicode characters (via romanji) and have gotten correct responses back.

Here is a part of the code I am using to make the request

langs = {'japanese': 'ja',
         'english': 'en'}

def get_sound_file_for_text(text, download=False, lang='japanese'):

    r = StringIO()
    glang = langs[lang]
    text = text.replace('*', '')
    text = text.replace('/', '')
    text = text.replace('x', '')
    url = 'http://translate.google.com/translate_tts'
    if download:
        result = requests.get(url, params={'tl': glang, 'q': text})
        r.write(result.content)
        r.seek(0)
        return r
    else:
        return url

Also, if I print textor url within this snippet, the kana/kanji is rendered correctly in my console.

Edit:

If I attempt to encode the unicode and quote it as such, I still get the same response.

# -*- coding: utf-8 -*-

from StringIO import StringIO
import urllib
import requests

__author__ = 'jacob'

langs = {'japanese': 'ja',
         'english': 'en'}

def get_sound_file_for_text(text, download=False, lang='japanese'):

    r = StringIO()
    glang = langs[lang]
    text = text.replace('*', '')
    text = text.replace('/', '')
    text = text.replace('x', '')
    text = urllib.quote(text.encode('utf-8'))
    url = 'http://translate.google.com/translate_tts?tl=%(glang)s&q=%(text)s' % locals()
    print url
    if download:
        result = requests.get(url)
        r.write(result.content)
        r.seek(0)
        return r
    else:
        return url

Which returns this:

http://translate.google.com/translate_tts?tl=ja&q=%E3%81%B2%E3%81%A8%E3%81%A4

Which seems like it should work, but doesn't.

Edit 2:

If I attempt to use urlllb/urllib2, I get a 403 error.

Edit 3:

So, it seems that this problem/behavior is simply limited to this endpoint. If I try the following URL, a different endpoint.

http://www.kanjidamage.com/kanji/13-un-%E4%B8%8D

From within requests and my browser, I get the same response (they match). If I even try ascii characters to the server, like this url.

http://translate.google.com/translate_tts?tl=ja&q=sayonara

I get the same response as well (they match again). But if I attempt to send unicode characters to this URL, I get a correct audio file on my browser, but not from requests, which sends an audio file, but with no sound.

http://translate.google.com/translate_tts?tl=ja&q=%E3%81%B2%E3%81%A8%E3%81%A4

So, it seems like this behavior is limited to the Google TTL URL?

You're console probably isn't configured correctly to display the characters and this sounds like a general utf-8 encoding problem. Can I ask what OS your on even though I'm pretty sure you're on windows machine. — Austin A, Jan 15 '15 at 02:46
@AustinA It's displaying fine on my console. It's a console within pycharm and I'm currently running this within a linux environment. — jab, Jan 15 '15 at 02:52
Hey sorry @jab, I misread you'r "is rendered" as "isn't rendered. Regardless, I hope the code snippet I added works. I'm not sure if you attempted UTF-8 encoding and decoding beforehand. — Austin A, Jan 15 '15 at 02:57
Another note， if you attempt to use　なな. It will return a "e" sound, like the character え. Everything else I've tried returns nothing. — jab, Jan 15 '15 at 03:16

score 2 · Accepted Answer · answered Jan 16 '15 at 08:42

The user agent can be part of the problem, however, it is not in this case. The translate_tts service rejects (with HTTP 403) some user agents, e.g. any that begin with Python, curl, wget, and possibly others. That is why you are seeing a HTTP 403 response when using urllib2.urlopen() - it sets the user agent to Python-urllib/2.7 (the version might vary).

You found that setting the user agent to Mozilla/5.0 fixed the problem, but that might work because the API might assume a particular encoding based on the user agent.

What you actually should do is to explicitly specify the URL character encoding with the ie field. Your URL request should look like this:

http://translate.google.com/translate_tts?ie=UTF-8&tl=ja&q=%E3%81%B2%E3%81%A8%E3%81%A4

Note the ie=UTF-8 which explicitly sets the URL character encoding. The spec does state that UTF-8 is the default, but doesn't seem entirely true, so you should always set ie in your requests.

The API supports kanji, hiragana, and katakana (possibly others?). These URLs all produce "nihongo", although the audio produced for hiragana input has a slightly different inflection to the others.

import requests

one = u'\u3072\u3068\u3064'
kanji = u'\u65e5\u672c\u8a9e'
hiragana = u'\u306b\u307b\u3093\u3054'
katakana = u'\u30cb\u30db\u30f3\u30b4'
url = 'http://translate.google.com/translate_tts'

for text in one, kanji, hiragana, katakana:
    r = requests.get(url, params={'ie': 'UTF-8', 'tl': 'ja', 'q': text})
    print u"{} -> {}".format(text, r.url)
    open(u'/tmp/{}.mp3'.format(text), 'wb').write(r.content)

score 0 · Answer 2 · edited May 23 '17 at 10:32

0

I made this little method before to help me with UTF-8 encoding. I was having issues printing cyrllic and CJK languages to csvs and this did the trick.

def assist(unicode_string):
    utf8 = unicode_string.encode('utf-8')
    read = utf8.decode('string_escape')

    return read   ## UTF-8 encoded string

Also, make sure you have these two lines at the beginning of your .py.

#!/usr/bin/python
# -*- coding: utf-8 -*-

The first line is just a good python habit, it specifies which compiler to use on the .py (really only useful if you have more than one version of python loaded on your machine). The second line specifies the encoding of the python file. A slightly longer answer for this is given here.

edited May 23 '17 at 10:32

Community

1
1

answered Jan 15 '15 at 02:55

Austin A

2,990
6
27
42

I tried encoding the unicode to utf-8, but I still get the same behavior. :/ – jab Jan 15 '15 at 03:07
Have you tried writing the output to a text file? I would be interested in hearing the result from that. – Austin A Jan 15 '15 at 04:02
when I use the requests library with a utf-8 encoded URL, I get a blank mp3 file (but the mp3 details are there). If I pass this same URL using urllib2/httplib2, I get a 403 error. – jab Jan 15 '15 at 15:25

score 0 · Answer 3 · answered Jan 15 '15 at 16:13

Setting the User-Agent to Mozilla/5.0 fixes this issue.

from StringIO import StringIO
import urllib
import requests

__author__ = 'jacob'

langs = {'japanese': 'ja',
         'english': 'en'}

def get_sound_file_for_text(text, download=False, lang='japanese'):

    r = StringIO()
    glang = langs[lang]
    text = text.replace('*', '')
    text = text.replace('/', '')
    text = text.replace('x', '')
    url = 'http://translate.google.com/translate_tts'
    if download:
        result = requests.get(url, params={'tl': glang, 'q': text}, headers={'User-Agent': 'Mozilla/5.0'})
        r.write(result.content)
        r.seek(0)
        return r
    else:
        return url

Python Requests URL with Unicode Parameters

3 Answers3

Linked