1

I'm trying the google cloud language_v1 api to extract named entities from some input text, however, I found out that something fishy is going on with the encoding parameter. When I run

txt = '''La divinité des uji la plus importante était ( et est toujours ) Amaterasu , la déesse solaire . '''.strip()
client = language.LanguageServiceClient()
document = types.Document(content=txt, type=enums.Document.Type.PLAIN_TEXT, language='fr')
ents = client.analyze_entities(document, encoding_type=EncodingType.UTF8)

ents can correctly detect the entity 'Amaterasu' however the returning starting offset is 67 instead of 65. However, if I specify encoding_type=EncodingType.UTF16 the offset is correct.

Notice that, by default, the encoding of python source code files is UTF-8 and that, anyway, I get the same result if I store the text in a UTF-8 file and I read it with the right encoding. Any idea what is going on?

Alberto
  • 597
  • 3
  • 17

0 Answers0