2

I have issues with a Python script. I just try to translate some sentences with the google translate API. Some sentences have problems with special UTF-8 encoding like ä, ö or ü. Can't imagine why some sentences work, others not.

If I try the API call direct in the browser, it works, but inside my Python script I get a mismatch.

this is a small version of my script which shows directly the error:

# -*- encoding: utf-8' -*-
import requests
import json

satz="Beneath the moonlight glints a tiny fragment of silver, a fraction of a line…"
url = 'https://translate.googleapis.com/translate_a/single?client=gtx&sl=en&tl=de&dt=t&q='+satz
r = requests.get(url);
r.text.encode().decode('utf8','ignore')
n = json.loads(r.text);
i = 0
while i < len(n[0]):
    newLine = n[0][i][0]
    print(newLine)
    i=i+1

this is how my result looks:

Unter dem Mondschein glänzt ein winziges Silberfragment, ein Bruchteil einer Li
nie â ? |
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
keschra
  • 289
  • 1
  • 3
  • 17
  • You do not need to encode then decode again. The default is UTF-8 **anyway**. – Martijn Pieters May 25 '18 at 20:48
  • That line can be dropped entirely, because you ignored the return value. The `r.text` value is not affected. And you should just use `r.json()` anyway. – Martijn Pieters May 25 '18 at 20:49
  • I can indeed reproduce the problem, but it is one on the side of Google, not your Python code. You are being served a Mojibake. – Martijn Pieters May 25 '18 at 20:52
  • @MartijnPieters so there is no way to solve this mojibake? I mean, if i visit the link https://translate.googleapis.com/translate_a/single?client=gtx&sl=en&tl=de&dt=t&q=Beneath the moonlight glints a tiny fragment of silver, a fraction of a line…" inside my browser i get back a textfile without mojibake – keschra May 27 '18 at 11:21
  • Ah, we may be on to something, as your browser doesn't just send that URL (with the spaces in it); it also takes care of *encoding your text to URL quoted data*. I'll take a further stab. – Martijn Pieters May 27 '18 at 14:13
  • Hrm, there must be an additional request header inspection that makes Google decide on what encoding to use when producing the result, because using `params={'q': satz}` and removing the `&q=` part of the `url` variable still produces the same. – Martijn Pieters May 27 '18 at 14:16

1 Answers1

1

Google has served you a Mojibake; the JSON response contains data that was original encoded using UTF-8 but then was decoded with a different codec resulting in incorrect data.

I suspect Google does this as it decodes the URL parameters; in the past URL parameters could be encoded in any number of codecs, that UTF-8 is now the standard is a relatively recent development. This is Google's fault, not yours or that of requests.

I found that setting a User-Agent header makes Google behave better; even an (incomplete) user agent of Mozilla/5.0 is enough here for Google to use UTF-8 when decoding your URL parameters.

You should also make sure your URL string is properly percent encoded, if you pass in parameters in a dictionary to params then requests will take care of adding those to the URL in properly :

satz = "Beneath the moonlight glints a tiny fragment of silver, a fraction of a line…"
url = 'https://translate.googleapis.com/translate_a/single?client=gtx&dt=t'
params = {
    'q': satz,
    'sl': 'en',
    'tl': 'de',
}
headers = {'user-agent': 'Mozilla/5.0'}
r = requests.get(url, params=params, headers=headers)
results = r.json()[0]
for inputline, outputline, *__ in results:
    print(outputline)

Note that I pulled out the source and target language parameters into the params dictionary too, and pulled out the input and output line values from the results lists.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343