Google Translate API returns non UTF8 characters

Question

Resolve See in the end of this post for the solution

Good evening.

Im trying to play with the google translate v3 api.

And I arrive on a mystical encoding issue.

I do this :

def translate_text_langueTarget(texteToTranslate, langueTarget):
     parent = client.location_path(project_id, location)
     langueOrigin = detect_language(texteToTranslate)
     if (langueOrigin == "en" and langueTarget == "en"):
         return(texteToTranslate)
     try:
         response = client.translate_text(
             parent=parent,
             contents=[texteToTranslate],
             mime_type='text/plain',
             source_language_code=langueOrigin,
             target_language_code=langueTarget)
         translatedTexte = str(response.translations)[19:-3]
     except:
         translatedTexte = "Sorry my friend, the translation is lost on the internet"

     print(response)
     print(type(response))
     print(response.translations)
     print(type(response.translations))
     return(translatedTexte)

I call this with

stringToTrad = "prefer"
langTarget = "fr"
translateString = translate_text_langueTarget(stringToTrad, langTarget)

And I expecte to have "préféré" in answer

But I obtain : "pr\303\251f\303\251rer"

I have try to look after this error with a bit of debug in my code, with :

print(response)
print(type(response))
print(response.translations)
print(type(response.translations))

I think it's a problem of encoding but i can't find a answer to my problem.

I work in python and my scrip is tag :

#! /usr/bin/env python3
# coding: utf-8

in the header

Do you have an idea ?

Resolve. I use :

translatedTexte = codecs.escape_decode(translatedTexte)[0]
translatedTexte = translatedTexte.decode("utf8")

Rub · Answer 1 · 2021-08-15T12:35:52.940

1

Apparently, the response from the API is html encoded (so it is UTF-8 wrapped in html encoding, also used for URL encoding).

The solution is simple.

import html

print(sf)
# Vinken rejoindra le conseil d&#39;administration en novembre.

print(html.unescape(sf))
# Vinken rejoindra le conseil d'administration en novembre.

+Info https://stackoverflow.com/a/48805931/4752223

edited Aug 15 '21 at 12:35

answered Aug 15 '21 at 12:30

Rub

2,071
21
37

Thanx a lot . I have resolve my probleme with a "magic trick" but you have the perfect answer. Thank a lot. – Julien JAUFFRET Aug 18 '21 at 09:10

score 0 · Accepted Answer · answered Jul 24 '20 at 13:09

0

API of Google Translate gives you UTF-8 text. You got c3 a9 (303 251 as octal numbers) which it is really é, as expected.

So your code take the correct UTF-8 file and it writes it as maybe wrong encoding.

This line is just a myth, not useful:

# coding: utf-8

If you want that your code interpret input and output as UTF-8, you should explicitly say so. With your code, I assume that (one problem) is that you use print (better to write into a file). On Windows, by default, terminals are not UTF-8, but old "Windows ANSI like and extended also know as Windows 1252" encoding.

So write into a file (with explicit UTF-8 encoding), or just change terminal settings, to have UTF-8 terminal. In addition, you may have escape sequences, on results. To me, it smell much, to have results written in octal way. Not a think of standard Python (and it will complain, about wrong encoding). You may need to parse the response, to translate escape sequences.

answered Jul 24 '20 at 13:09

Giacomo Catenazzi

8,519
2
24
32

First tanks for your help. I simplify my problem when i ask this question. I dont wont to print in it in a terminal. This was just for debug. I wok in a flask api. I try to make api who parse json comments { "id=1694639": "Je préfère le bleu que le rouge" } Translate them in english for work purpose with patter library. Then i re-translate them in their origin language. Last i send a response. A i obtain this with postman when i want to test my api. "pr\\303\\251f\\303\\251rer": 1, "pr\\303\\251f\\303\\251rer_id": [ "1694639" ] How to output them in utf8 ? – Julien JAUFFRET Jul 24 '20 at 13:24
Have you looked the API documentation. You uses just the output, without parsing. The magic `[19:-3]` smell. Please parse the output properly, and you get your proper string. – Giacomo Catenazzi Jul 24 '20 at 13:46
I resolve my problem with thuis translatedTexte = codecs.escape_decode(translatedTexte)[0] translatedTexte = translatedTexte.decode("utf8") I search for this for 4h and i nerver see the codecs side of my problem. I look at it thanks to your "interpret input and output as UTF-8, you should explicitly say so" Thanks a lot. – Julien JAUFFRET Jul 24 '20 at 13:51
"# coding: utf-8" *is* useful. It declares the encoding of the *Python source file* but has nothing to do with the encoding of files, sockets, databases, etc.' – Mark Tolonen Jul 24 '20 at 17:59
@MarkTolonen: it is default, and so it is not used if there are strong suggestion that that UTF-8 is not the right encoding. So skipping such line could only improve things (e.g. copying/pasting code). The problem with that line: too many people here suggest it as solution of encoding problems (which it is not), so that line is also cause of problems. – Giacomo Catenazzi Jul 25 '20 at 05:03

Google Translate API returns non UTF8 characters

2 Answers2