How is UTF-16 converting string?

Question

b'\x14\xfeh\x00e\x00l\x00l\x00o\x00 \x00w\x00o\x00r\x00l\x00d\x00'

I understand that UTF-16 uses 16 bits but what confuses me the most is that 16 bits is two characters, so why do I see a long line of hexadecimal characters? It should be like for example "ee" these are 16 bits 8 bits in the character.

Can someone explain to me why I see a long line of hexadecimals?

How does the utf-16 converts strings ???? What is the theory behind it ??

score 1 · Answer 1 · answered Nov 06 '22 at 11:04

Because of the notation, I guess you're using Python. In Python, the b'...' notation is used for bytes objects.

When str or bytes objects are represented in the source code or on the terminal, all characters that represent a printable ASCII character (roughly all values from 32 to 127), are shown as that character. All other characters are escaped using the \xx notation, where xx is the hexadecimal number. This is why you see a strange mix of printable characters and escape codes.

Note that you can escape printable characters as well: b'\x41' is the same as b'A', since the hexadecimal number 41 (65 in decimal) is the letter A in ASCII. However, the Python interpreter doesn't do this by default.

How does UTF-16 work?

UTF-16 simply uses 16 bits (= 2 bytes) for every character ¹. There are however two variants of ordering the bytes, called little endian and big endian. To decode UTF-16 data, you have to know which encoding was used. Sometimes UTF-16 data starts with a Byte Order Mark (BOM), which is a special character that can be used to determine the byte ordering.

Your Python string b'\x14\xfeh\x00e\x00l\x00l\x00o\x00 \x00w\x00o\x00r\x00l\x00d\x00' consists of 24 bytes, so 12 UTF-16 characters. I guess your first byte is corrupted somehow, because it results in a strange character. It probably should have been \xFF instead of \x14, because when the data starts with the two bytes \xff\xfe, this is a signal that the bytes are stored in Little Endian format. (See this table on Wikipedia).

Finally, decoding the data in Python is very simple:

b'\xff\xfeh\x00e\x00l\x00l\x00o\x00 \x00w\x00o\x00r\x00l\x00d\x00'.decode('utf-16')

output:

'hello world'

¹ This is not entirely true, because some special characters are actually represented using a combination of two UTF-16 characters, but you should probably ignore that for now. For (much) more information about UTF-16, see Wikipedia.

How is UTF-16 converting string?

1 Answers1