UTF-16 Code Units In Python Polyglot

Question

I need to extract the number of UTF-16 code units from the start of the string at which a location name starts from a Python sting. I am using Polyglot NER to tag a location in a Python string. For example, "Obama was born in the United States. But I was born in Alabama", would mark "United States" and "Alabama". The Python Polyglot extractor simply returns to the locations tagged, and the how many words from the front they start. How do I figure out the number of UTF-16 code units from the start of the string the word occurs?

Java interface that requires the information https://github.com/Berico-Technologies/CLAVIN/blob/master/src/main/java/com/bericotech/clavin/extractor/LocationOccurrence.java

You can't decode it first so that you're working with text instead? — Ignacio Vazquez-Abrams, Sep 22 '16 at 23:21
I am working with text. I honestly can't figure out how it is using code units as a distance, or how to get that distance — Sepehr Sobhani, Sep 22 '16 at 23:23
If you need to care about the encoding then you're working with bytes, not text. — Ignacio Vazquez-Abrams, Sep 22 '16 at 23:25

monkut · Answer 1 · 2016-09-23T02:36:07.000

Just to clarify some of @Ignacio Vazquez-Abrams' comments. When processing or analyzing text you don't want to have to worry about how many bytes a given character takes up. That's why you take the 'encoding' out of the equation by first 'decoding' the encoded text to a separate text/str representation.

>>> encoded_text = 'hello world'.encode('utf16')
>>> encoded_text
b'\xff\xfeh\x00e\x00l\x00l\x00o\x00 \x00w\x00o\x00r\x00l\x00d\x00'
>>> type(encoded_text)
<class 'bytes'>
>>> len(encoded_text)
24


>>> decoded_text = encoded_text.decode('utf16')
>>> decoded_text
'hello world'
>>> type(decoded_text)
<class 'str'>
>>>
>>> len(decoded_text)
11

I did see the UTF-16 code units in the java code you posted...

You could do something like this to get the number of bytes from the start:

sentence = "Obama was born in the United States. But I was born in Alabama".encode('UTF-16LE')
word = 'United States'.encode('UTF-16LE')

bytes_from_start = None
for start_byte_position in range(len(sentence)):
    candidate = sentence[start_byte_position: start_byte_position + len(word)]
    if word == candidate:
        bytes_from_start = len(sentence[:start_byte_position])
        print('bytes from start: ', bytes_from_start)
        print('len(sentence[:start_byte_position]): ', len(sentence[:start_byte_position]))
        print('Preceding text: "{}"'.format(sentence[:start_byte_position].decode('UTF-16LE')))
        break

But it's still not clear if UTF-16 code units == bytes. I have a feeling it really just wants the number of characters from the start. And if that's all you need you can use the str object's .index() method:

sentence = "Obama was born in the United States. But I was born in Alabama"
word = 'United States'
characters_from_start = sentence.index(word)

The requirement of the interface is UTF-16 code units, so 'encoding' is and how many bytes are character takes up is important. — Sepehr Sobhani, Sep 23 '16 at 01:15

UTF-16 Code Units In Python Polyglot

1 Answers1