I'm using Elasticsearch with the Python client and I have a question about the interaction between unicode, ES, analyzers, and emojis. When I try to run a unicode text string that contains an emoji character through the ES analyzer, it seems to screw up the term offsets in the resulting output.
For example:
>> es.indices.analyze(body=u'\U0001f64f testing')
{u'tokens': [{u'end_offset': 10,
u'position': 1,
u'start_offset': 3,
u'token': u'testing',
u'type': u'<ALPHANUM>'}]}
This gives me the wrong offsets for the term testing.
>> u'\U0001f64f testing'[3:10]
u'esting'
If I do it with another unicode foreign character (for example the yen symbol), I don't get the same error.
>> es.indices.analyze(body=u'\u00A5 testing')
{u'tokens': [{u'end_offset': 9,
u'position': 1,
u'start_offset': 2,
u'token': u'testing',
u'type': u'<ALPHANUM>'}]}
>> u'\u00A5 testing'[2:9]
u'testing'
Can anybody explain what is going on?