In my application, I am trying to translate a few Spanish texts to English using the Google Cloud Translation API. I noticed that the texts that I am receiving after translation is HTML escaped, meaning that "'"
appears as "'<nospace>;"
, ">"
appears as ">"
and so on.
I have also checked the translation using the rest API via curl script and it gives me the same escaped result -
curl --request GET 'https://translation.googleapis.com/language/translate/v2?key=$GOOGLE_API_KEY&q=Es%20un%20brillante%20d%C3%ADa%20soleado&target=en'
The response to this curl is -
{
"data": {
"translations": [
{
"translatedText": "It's a bright sunny day",
"detectedSourceLanguage": "es"
}
]
}
}
When I try translating the same Spanish text to English in online google translator (i.e., https://translate.google.com), I get the English text as "It's a bright sunny day".
My first question: Has it been done like that due to some reason or is this a bug?
To unescape the text, I am using org.apache.commons.text.StringEscapeUtils.unescapeHtml4()
-
StringEscapeUtils.unescapeHtml4(translation.getTranslatedText());
My second question: Is this the right way of unescaping the translated text?