1

In my application, I am trying to translate a few Spanish texts to English using the Google Cloud Translation API. I noticed that the texts that I am receiving after translation is HTML escaped, meaning that "'" appears as "&#39<nospace>;", ">" appears as "&gt;" and so on.

I have also checked the translation using the rest API via curl script and it gives me the same escaped result -

curl --request GET 'https://translation.googleapis.com/language/translate/v2?key=$GOOGLE_API_KEY&q=Es%20un%20brillante%20d%C3%ADa%20soleado&target=en'

The response to this curl is -

    {
  "data": {
    "translations": [
      {
        "translatedText": "It&#39;s a bright sunny day",
        "detectedSourceLanguage": "es"
      }
    ]
  }
}

When I try translating the same Spanish text to English in online google translator (i.e., https://translate.google.com), I get the English text as "It's a bright sunny day".

My first question: Has it been done like that due to some reason or is this a bug?

To unescape the text, I am using org.apache.commons.text.StringEscapeUtils.unescapeHtml4() -

StringEscapeUtils.unescapeHtml4(translation.getTranslatedText());

My second question: Is this the right way of unescaping the translated text?

ronojoy ghosh
  • 121
  • 10

2 Answers2

4

This question is similar to Google Translate API outputs HTML entities

Since the translation format is not explicitly provided, the Google Translation API is taking the default format, which is HTML. Hence, it is returning html encoded string as translated text. If the format is explicitly provided as "text", the html encoding will not happen.

The curl to request the translation should now be like this -

curl --request GET 'https://translation.googleapis.com/language/translate/v2?key=$GOOGLE_API_KEY&q=Es%20un%20brillante%20d%C3%ADa%20soleado&target=en&format=text'

The response is -

    {
  "data": {
    "translations": [
      {
        "translatedText": "It's a bright sunny day",
        "detectedSourceLanguage": "es"
      }
    ]
  }
}

Therefore, unescaping html is not required here, since the encoding can be avoided in the first place.

ronojoy ghosh
  • 121
  • 10
0

By default HTML will be escaped, but the format parameter will change &#39; and &quot; to ' and ".

For java you can access the format field.

public static Translate.TranslateOption format(String format)

Sets the format of the source text, in either HTML (default) or plain-text. A value of html indicates HTML and a value of text indicates plain-text.

For example the quotes in here:

$ curl -s -X POST -H "Content-Type: application/json"     -H "Authorization: Bearer "$(gcloud auth application-default print-access-token)     --data "{
  'q': 'The three \'pyramids\' in the Giza pyramid complex.',
  'source': 'en',
  'target': 'es' 
}" "https://translation.googleapis.com/language/translate/v2"
{
  "data": {
    "translations": [
      {
        "translatedText": "Las tres &#39;pirĂ¡mides&#39; en el complejo piramidal de Giza."
      }
    ]
  }
}

will remain with format = text.

curl -s -X POST -H "Content-Type: application/json"     -H "Authorization: Bearer "$(gcloud auth application-default print-access-token)     --data "{
  'q': 'The three \'pyramids\' in the Giza pyramid complex.',
  'source': 'en',
  'target': 'es',
  'format': 'text'
}" "https://translation.googleapis.com/language/translate/v2"
{
  "data": {
    "translations": [
      {
        "translatedText": "Las tres 'pirĂ¡mides' en el complejo piramidal de Giza."
      }
    ]
  }
}
rsantiago
  • 2,054
  • 8
  • 17