How to analyze accented characters with Google Cloud Natural Language

Question

I am trying to use the python client on Python3 (collab) for analyzing text with accented characters. I am setting up the document object with type PLAIN_TEXT.

# Run a sentiment analysis request on text
def nlp_analyze_text(text, lang=nlp_def_language):
  client = language.LanguageServiceClient()

  document = types.Document(
      content=text,
      language=lang,      
      type=enums.Document.Type.PLAIN_TEXT)
  entities = client.analyze_entities(document=document, encoding_type='UTF32')
  syntax = client.analyze_syntax(document=document)

  return (entities, syntax)

As so the input that is feed into the client contains multibyte characters.

text = u"Mi vieja mula ya no es lo que era? Qué era entonces? Era de Bs.As. Saludos!"
nlp_analyze_text(text)

This I believe is not properly understood by Google Cloud NL.

sentences {
   text {
     content: "Qu\303\251 era entonces?"
     begin_offset: -1
   }
 }

So, how should I setup the code for analyzing text with accented characters.

Thanks

score 0 · Answer 1 · answered May 24 '19 at 18:18

0

After all, I was looking at escaped characters because how the object being printed had its str implementation. When I printed deeper attributes, I saw the string unescaped.

Hope this post helps others.

answered May 24 '19 at 18:18

Gabriel

809
1
10
21

How to analyze accented characters with Google Cloud Natural Language

1 Answers1