1

Resolve See in the end of this post for the solution

Good evening.

Im trying to play with the google translate v3 api.

And I arrive on a mystical encoding issue.

I do this :

def translate_text_langueTarget(texteToTranslate, langueTarget):
     parent = client.location_path(project_id, location)
     langueOrigin = detect_language(texteToTranslate)
     if (langueOrigin == "en" and langueTarget == "en"):
         return(texteToTranslate)
     try:
         response = client.translate_text(
             parent=parent,
             contents=[texteToTranslate],
             mime_type='text/plain',
             source_language_code=langueOrigin,
             target_language_code=langueTarget)
         translatedTexte = str(response.translations)[19:-3]
     except:
         translatedTexte = "Sorry my friend, the translation is lost on the internet"

     print(response)
     print(type(response))
     print(response.translations)
     print(type(response.translations))
     return(translatedTexte)

I call this with

stringToTrad = "prefer"
langTarget = "fr"
translateString = translate_text_langueTarget(stringToTrad, langTarget)

And I expecte to have "préféré" in answer

But I obtain : "pr\303\251f\303\251rer"

I have try to look after this error with a bit of debug in my code, with :

print(response)
print(type(response))
print(response.translations)
print(type(response.translations))

I think it's a problem of encoding but i can't find a answer to my problem.

I work in python and my scrip is tag :

#! /usr/bin/env python3
# coding: utf-8

in the header

Do you have an idea ?

Resolve. I use :

translatedTexte = codecs.escape_decode(translatedTexte)[0]
translatedTexte = translatedTexte.decode("utf8")
Rub
  • 2,071
  • 21
  • 37

2 Answers2

1

Apparently, the response from the API is html encoded (so it is UTF-8 wrapped in html encoding, also used for URL encoding).

The solution is simple.

import html

print(sf)
# Vinken rejoindra le conseil d'administration en novembre.

print(html.unescape(sf))
# Vinken rejoindra le conseil d'administration en novembre.

+Info https://stackoverflow.com/a/48805931/4752223

Rub
  • 2,071
  • 21
  • 37
0

API of Google Translate gives you UTF-8 text. You got c3 a9 (303 251 as octal numbers) which it is really é, as expected.

So your code take the correct UTF-8 file and it writes it as maybe wrong encoding.

This line is just a myth, not useful:

# coding: utf-8

If you want that your code interpret input and output as UTF-8, you should explicitly say so. With your code, I assume that (one problem) is that you use print (better to write into a file). On Windows, by default, terminals are not UTF-8, but old "Windows ANSI like and extended also know as Windows 1252" encoding.

So write into a file (with explicit UTF-8 encoding), or just change terminal settings, to have UTF-8 terminal. In addition, you may have escape sequences, on results. To me, it smell much, to have results written in octal way. Not a think of standard Python (and it will complain, about wrong encoding). You may need to parse the response, to translate escape sequences.

Giacomo Catenazzi
  • 8,519
  • 2
  • 24
  • 32
  • First tanks for your help. I simplify my problem when i ask this question. I dont wont to print in it in a terminal. This was just for debug. I wok in a flask api. I try to make api who parse json comments { "id=1694639": "Je préfère le bleu que le rouge" } Translate them in english for work purpose with patter library. Then i re-translate them in their origin language. Last i send a response. A i obtain this with postman when i want to test my api. "pr\\303\\251f\\303\\251rer": 1, "pr\\303\\251f\\303\\251rer_id": [ "1694639" ] How to output them in utf8 ? – Julien JAUFFRET Jul 24 '20 at 13:24
  • Have you looked the API documentation. You uses just the output, without parsing. The magic `[19:-3]` smell. Please parse the output properly, and you get your proper string. – Giacomo Catenazzi Jul 24 '20 at 13:46
  • I resolve my problem with thuis translatedTexte = codecs.escape_decode(translatedTexte)[0] translatedTexte = translatedTexte.decode("utf8") I search for this for 4h and i nerver see the codecs side of my problem. I look at it thanks to your "interpret input and output as UTF-8, you should explicitly say so" Thanks a lot. – Julien JAUFFRET Jul 24 '20 at 13:51
  • "# coding: utf-8" *is* useful. It declares the encoding of the *Python source file* but has nothing to do with the encoding of files, sockets, databases, etc.' – Mark Tolonen Jul 24 '20 at 17:59
  • @MarkTolonen: it is default, and so it is not used if there are strong suggestion that that UTF-8 is not the right encoding. So skipping such line could only improve things (e.g. copying/pasting code). The problem with that line: too many people here suggest it as solution of encoding problems (which it is not), so that line is also cause of problems. – Giacomo Catenazzi Jul 25 '20 at 05:03