Elasticsearch Python emojis and term offsets in analyzers

Question

I'm using Elasticsearch with the Python client and I have a question about the interaction between unicode, ES, analyzers, and emojis. When I try to run a unicode text string that contains an emoji character through the ES analyzer, it seems to screw up the term offsets in the resulting output.

For example:

>> es.indices.analyze(body=u'\U0001f64f testing')
{u'tokens': [{u'end_offset': 10,
   u'position': 1,
   u'start_offset': 3,
   u'token': u'testing',
   u'type': u'<ALPHANUM>'}]}

This gives me the wrong offsets for the term testing.

>> u'\U0001f64f testing'[3:10]
u'esting'

If I do it with another unicode foreign character (for example the yen symbol), I don't get the same error.

>> es.indices.analyze(body=u'\u00A5 testing')
{u'tokens': [{u'end_offset': 9,
   u'position': 1,
   u'start_offset': 2,
   u'token': u'testing',
   u'type': u'<ALPHANUM>'}]}

>> u'\u00A5 testing'[2:9]
u'testing'

Can anybody explain what is going on?

score 1 · Answer 1 · answered May 06 '21 at 16:29

I was facing the exact same issue and managed to map the offsets correctly by encoding back and forth in UTF-16:

TEXT = " carrot"
TOKENS = es.indices.analyze(body=TEXT)["tokens"]
# [
#   {
#     "token" : """""",
#     "start_offset" : 0,
#     "end_offset" : 2,
#     "type" : "<EMOJI>",
#     "position" : 0
#   },
#   {
#     "token" : "carrot",
#     "start_offset" : 3,
#     "end_offset" : 9,
#     "type" : "<ALPHANUM>",
#     "position" : 1
#   }
# ]

ENCODED_TEXT = text.encode("utf-16")
# b'\xff\xfe>\xd8U\xdd \x00c\x00a\x00r\x00r\x00o\x00t\x00'
BOM_MARK_OFFSET = 2

def get_decoded_token(encoded_text, token):
    start_offset = (token["start_offset"] * 2) + BOM_MARK_OFFSET
    end_offset = (token["end_offset"] * 2) + BOM_MARK_OFFSET
    return encoded_text[start_offset:end_offset].decode("utf-16")

assert get_decoded_token(ENCODED_TEXT, TOKENS[0]) == ""
assert get_decoded_token(ENCODED_TEXT, TOKENS[1]) == "carrot"

For BOM_OFFSET_MARK, see https://en.wikipedia.org/wiki/Byte_order_mark#UTF-16

Mark Tolonen · Answer 2 · 2015-09-19T23:41:26.250

Python 3.2 or earlier? Before Python 3.3 on Windows there are narrow and wide Unicode builds. Narrow builds use two bytes per character and internally encode Unicode codepoints using UTF-16, which requires two UTF-16 surrogates to encode Unicode codepoints above U+FFFF.

Python 3.3.5 (v3.3.5:62cf4e77f785, Mar  9 2014, 10:35:05) [MSC v.1600 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> len('\U0001f64f')
1
>>> '\U0001f64f'[0]
'\U0001f64f'

Python 2.7.9 (default, Dec 10 2014, 12:24:55) [MSC v.1500 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> len(u'\U0001f64f')
2
>>> u'\U0001f64f'[0]
u'\ud83d'
>>> u'\U0001f64f'[1]
u'\ude4f'

In your case, however, the offsets are correct. Because U+1F64F uses two UTF-16 surrogates, the offset of "t" is 3. I'm not sure how you got you're output:

Python 2.7.9 (default, Dec 10 2014, 12:24:55) [MSC v.1500 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> x=u'\U0001f64f testing'
>>> x
u'\U0001f64f testing'
>>> x[3:10]
u'testing'
>>> y = u'\u00a5 testing'
>>> y[2:9]
u'testing'

On Python 2, there are narrow (your case, Windows) and wide CPython builds e.g., `u'\U0001f64f'[0] == u'\U0001f64f'` on Ubuntu. — jfs, Sep 19 '15 at 23:02

Elasticsearch Python emojis and term offsets in analyzers

2 Answers2