I'm trying the google cloud language_v1
api to extract named entities from some input text, however, I found out that something fishy is going on with the encoding parameter. When I run
txt = '''La divinité des uji la plus importante était ( et est toujours ) Amaterasu , la déesse solaire . '''.strip()
client = language.LanguageServiceClient()
document = types.Document(content=txt, type=enums.Document.Type.PLAIN_TEXT, language='fr')
ents = client.analyze_entities(document, encoding_type=EncodingType.UTF8)
ents
can correctly detect the entity 'Amaterasu' however the returning starting offset is 67 instead of 65. However, if I specify encoding_type=EncodingType.UTF16
the offset is correct.
Notice that, by default, the encoding of python source code files is UTF-8 and that, anyway, I get the same result if I store the text in a UTF-8 file and I read it with the right encoding. Any idea what is going on?