UTF-16 codepoint counting in python

Question

I'm getting some data from an API (telegram-bot) I'm using. I'm using the python-telegram-bot library which interacts with the Telegram Bot api. The data is returned in the UTF-8 encoding in JSON format. Example (snippet):

{'message': {'text': '\u200d\u200d\u200dhttp://google.com/æøå', 'entities': [{'type': 'url', 'length': 21, 'offset': 11}], 'message_id': 2655}}

It can be seen that 'entities' contains a single entity of type url and it has a length and an offset. Now say I wanted to extract the url of the link in the 'text' attribute:

data = {'message': {'text': '\u200d\u200d\u200dhttp://google.com/æøå', 'entities': [{'type': 'url', 'length': 21, 'offset': 11}], 'message_id': 2655}}
entities = data['entities']
for entity in entities:
    start = entity['offset']
    end = start + entity['length']
    print('Url: ', text[start:end])

The code above, however, returns: '://google.com/æøå' which is clearly not the actual url.
The reason for this is that the offset and length are in UTF-16 codepoints. So my question is: Is there any way to work with UTF-16 codepoints in python? I don't need more than to be able to count them.

I've already tried:

text.encode('utf-8').decode('utf-16')

But that gives the error: UnicodeDecodeError: 'utf-16-le' codec can't decode byte 0xa5 in position 48: truncated data

Any help would be greatly appreciated. I'm using python 3.5, but since it's for a unified library it would be lovely to get it to work in python 2.x too.

Welcome to the site! Check out the [tour](https://stackoverflow.com/tour) for more on asking questions that will attract quality answers. Would you please [edit your question](https://stackoverflow.com/posts/39280183/edit) to include the following? (1) a link to the specific API you are using; (2) the code you are using to receive the response text; and (3) the code you are using to obtain the `entities` list? That information would be very helpful to me in answering. Thank you! — cxw, Sep 01 '16 at 20:25
@cxw I've edited my answer to reflect the changes the best I can. — jsmnbom, Sep 01 '16 at 20:40

Martijn Pieters · Accepted Answer · 2016-09-01T20:51:45.430

6

Python has already correctly decoded the UTF-8 encoded JSON data to Python (Unicode) strings, so there is no need to handle UTF-8 here.

You'd have to encode to UTF-16, take the length of the encoded data, and divide by two. I'd encode to either utf-16-le or utf-16-be to prevent a BOM from being added:

>>> len(text.encode('utf-16-le')) // 2
32

To use the entity offsets, you can encode to UTF-16, slice on doubled offsets, then decode again:

text_utf16 = text.encode('utf-16-le')
for entity in entities:
    start = entity['offset']
    end = start + entity['length']
    entity_text = text_utf16[start * 2:end * 2].decode('utf-16-le')
    print('Url: ', entity_text)

edited Sep 01 '16 at 20:51

answered Sep 01 '16 at 20:32

Martijn Pieters

1,048,767
296
4,058
3,343

I get a `UnicodeDecodeError: 'utf-16-le' codec can't decode byte 0x6f in position 30: truncated data` when I try to execute that code... Could that be related to this from the python reference? "Changed in version 3.4: The utf-16* and utf-32* encoders no longer allow surrogate code points (U+D800–U+DFFF) to be encoded. The utf-32* decoders no longer decode byte sequences that correspond to surrogate code points." – jsmnbom Sep 01 '16 at 20:42
@bomjacob: no, this is not related to that. You managed to slice a UTF-16 byte pair in half, check that you correctly doubled the offsets. The code I posted in my answer correctly prints `http://google.com/æøå`. – Martijn Pieters Sep 01 '16 at 20:43
@bomjacob: if you had a narrow Python build (3.3 or below) you'd not have problems with the slicing in the first place as the Python string would also treat the non-BMP characters as having a width of 2. – Martijn Pieters Sep 01 '16 at 20:44
@Martjin Oh yeah, sorry, I derped. This seems to work flawlessly, even on the weirdest surrogate characters that I can find :) Thank you so much for the help :D – jsmnbom Sep 01 '16 at 20:50
1

@Martjin, alright, that's good to know, since I probably have to port this code to python 2 too. Is there any way to check if it's narrow or wide build? – jsmnbom Sep 01 '16 at 20:51
2

@bomjacob: check the value of [`sys.maxunicode`](https://docs.python.org/2/library/sys.html#sys.maxunicode); if it is set to 65535 (== `0xffff`) you have a narrow build, a wide build otherwise (and the value will be 1114111 == `0x10ffff`). – Martijn Pieters Sep 01 '16 at 20:52

UTF-16 codepoint counting in python

1 Answers1