1

I'm using Elasticsearch with the Python client and I have a question about the interaction between unicode, ES, analyzers, and emojis. When I try to run a unicode text string that contains an emoji character through the ES analyzer, it seems to screw up the term offsets in the resulting output.

For example:

>> es.indices.analyze(body=u'\U0001f64f testing')
{u'tokens': [{u'end_offset': 10,
   u'position': 1,
   u'start_offset': 3,
   u'token': u'testing',
   u'type': u'<ALPHANUM>'}]}

This gives me the wrong offsets for the term testing.

>> u'\U0001f64f testing'[3:10]
u'esting'

If I do it with another unicode foreign character (for example the yen symbol), I don't get the same error.

>> es.indices.analyze(body=u'\u00A5 testing')
{u'tokens': [{u'end_offset': 9,
   u'position': 1,
   u'start_offset': 2,
   u'token': u'testing',
   u'type': u'<ALPHANUM>'}]}

>> u'\u00A5 testing'[2:9]
u'testing'

Can anybody explain what is going on?

plam
  • 1,305
  • 3
  • 15
  • 24

2 Answers2

1

I was facing the exact same issue and managed to map the offsets correctly by encoding back and forth in UTF-16:

TEXT = " carrot"
TOKENS = es.indices.analyze(body=TEXT)["tokens"]
# [
#   {
#     "token" : """""",
#     "start_offset" : 0,
#     "end_offset" : 2,
#     "type" : "<EMOJI>",
#     "position" : 0
#   },
#   {
#     "token" : "carrot",
#     "start_offset" : 3,
#     "end_offset" : 9,
#     "type" : "<ALPHANUM>",
#     "position" : 1
#   }
# ]

ENCODED_TEXT = text.encode("utf-16")
# b'\xff\xfe>\xd8U\xdd \x00c\x00a\x00r\x00r\x00o\x00t\x00'
BOM_MARK_OFFSET = 2

def get_decoded_token(encoded_text, token):
    start_offset = (token["start_offset"] * 2) + BOM_MARK_OFFSET
    end_offset = (token["end_offset"] * 2) + BOM_MARK_OFFSET
    return encoded_text[start_offset:end_offset].decode("utf-16")

assert get_decoded_token(ENCODED_TEXT, TOKENS[0]) == ""
assert get_decoded_token(ENCODED_TEXT, TOKENS[1]) == "carrot"

For BOM_OFFSET_MARK, see https://en.wikipedia.org/wiki/Byte_order_mark#UTF-16

rlat
  • 492
  • 4
  • 12
0

Python 3.2 or earlier? Before Python 3.3 on Windows there are narrow and wide Unicode builds. Narrow builds use two bytes per character and internally encode Unicode codepoints using UTF-16, which requires two UTF-16 surrogates to encode Unicode codepoints above U+FFFF.

Python 3.3.5 (v3.3.5:62cf4e77f785, Mar  9 2014, 10:35:05) [MSC v.1600 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> len('\U0001f64f')
1
>>> '\U0001f64f'[0]
'\U0001f64f'

Python 2.7.9 (default, Dec 10 2014, 12:24:55) [MSC v.1500 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> len(u'\U0001f64f')
2
>>> u'\U0001f64f'[0]
u'\ud83d'
>>> u'\U0001f64f'[1]
u'\ude4f'

In your case, however, the offsets are correct. Because U+1F64F uses two UTF-16 surrogates, the offset of "t" is 3. I'm not sure how you got you're output:

Python 2.7.9 (default, Dec 10 2014, 12:24:55) [MSC v.1500 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> x=u'\U0001f64f testing'
>>> x
u'\U0001f64f testing'
>>> x[3:10]
u'testing'
>>> y = u'\u00a5 testing'
>>> y[2:9]
u'testing'
Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251
  • On Python 2, there are narrow (your case, Windows) and wide CPython builds e.g., `u'\U0001f64f'[0] == u'\U0001f64f'` on Ubuntu. – jfs Sep 19 '15 at 23:02