UTF8 missmatch in script

Question

I have issues with a Python script. I just try to translate some sentences with the google translate API. Some sentences have problems with special UTF-8 encoding like ä, ö or ü. Can't imagine why some sentences work, others not.

If I try the API call direct in the browser, it works, but inside my Python script I get a mismatch.

this is a small version of my script which shows directly the error:

# -*- encoding: utf-8' -*-
import requests
import json

satz="Beneath the moonlight glints a tiny fragment of silver, a fraction of a line…"
url = 'https://translate.googleapis.com/translate_a/single?client=gtx&sl=en&tl=de&dt=t&q='+satz
r = requests.get(url);
r.text.encode().decode('utf8','ignore')
n = json.loads(r.text);
i = 0
while i < len(n[0]):
    newLine = n[0][i][0]
    print(newLine)
    i=i+1

this is how my result looks:

Unter dem Mondschein glÃ¤nzt ein winziges Silberfragment, ein Bruchteil einer Li
nie â ? |

You do not need to encode then decode again. The default is UTF-8 **anyway**. — Martijn Pieters, May 25 '18 at 20:48
That line can be dropped entirely, because you ignored the return value. The `r.text` value is not affected. And you should just use `r.json()` anyway. — Martijn Pieters, May 25 '18 at 20:49
I can indeed reproduce the problem, but it is one on the side of Google, not your Python code. You are being served a Mojibake. — Martijn Pieters, May 25 '18 at 20:52
@MartijnPieters so there is no way to solve this mojibake? I mean, if i visit the link https://translate.googleapis.com/translate_a/single?client=gtx&sl=en&tl=de&dt=t&q=Beneath the moonlight glints a tiny fragment of silver, a fraction of a line…" inside my browser i get back a textfile without mojibake — keschra, May 27 '18 at 11:21
Ah, we may be on to something, as your browser doesn't just send that URL (with the spaces in it); it also takes care of *encoding your text to URL quoted data*. I'll take a further stab. — Martijn Pieters, May 27 '18 at 14:13
Hrm, there must be an additional request header inspection that makes Google decide on what encoding to use when producing the result, because using `params={'q': satz}` and removing the `&q=` part of the `url` variable still produces the same. — Martijn Pieters, May 27 '18 at 14:16

score 1 · Answer 1 · answered May 27 '18 at 14:31

Google has served you a Mojibake; the JSON response contains data that was original encoded using UTF-8 but then was decoded with a different codec resulting in incorrect data.

I suspect Google does this as it decodes the URL parameters; in the past URL parameters could be encoded in any number of codecs, that UTF-8 is now the standard is a relatively recent development. This is Google's fault, not yours or that of requests.

I found that setting a User-Agent header makes Google behave better; even an (incomplete) user agent of Mozilla/5.0 is enough here for Google to use UTF-8 when decoding your URL parameters.

You should also make sure your URL string is properly percent encoded, if you pass in parameters in a dictionary to params then requests will take care of adding those to the URL in properly :

satz = "Beneath the moonlight glints a tiny fragment of silver, a fraction of a line…"
url = 'https://translate.googleapis.com/translate_a/single?client=gtx&dt=t'
params = {
    'q': satz,
    'sl': 'en',
    'tl': 'de',
}
headers = {'user-agent': 'Mozilla/5.0'}
r = requests.get(url, params=params, headers=headers)
results = r.json()[0]
for inputline, outputline, *__ in results:
    print(outputline)

Note that I pulled out the source and target language parameters into the params dictionary too, and pulled out the input and output line values from the results lists.

UTF8 missmatch in script

1 Answers1