Attempting to discover unicode code point for a character

Question

I have a character that when viewed in a hex editor is shown as:

FF  FE  08  27

meaning it's binary representation is (a four byte encoding):

Looking at the unicode table and description this doesn't seem to make sense, since a four byte encoding must have the leading byte in the form 11110xxx.

I'm most likely misunderstanding the unicode rules, but could you please let me know where I'm going wrong in determining the code point for this character?

This is not [valid UTF-8](https://en.wikipedia.org/wiki/Utf-8#Codepage_layout). It looks more like UTF-16. — Karol S, Jul 10 '14 at 20:42
I rolled back your edit to my answer. `U+FEFF` is correct and is the byte-order mark. That your raw data has it reversed is what indicates little-endian. The least significant byte is first. See http://en.wikipedia.org/wiki/Byte_order_mark. — Mark Tolonen, Jul 11 '14 at 01:45

Mark Tolonen · Accepted Answer · 2014-07-11T01:52:53.497

The Unicode character U+FEFF is the byte order mark (BOM) indicating this is little-endian UTF-16 encoding, not UTF-8. The character is U+2708, or the AIRPLANE(✈️) character.

A little proof using Python 3:

>>> import unicodedata as ud
>>> s=b'\xff\xfe\x08\x27'
>>> s.decode('utf16')      # Removes BOM and uses indicated little-endian decode.
'\u2708'
>>> s.decode('utf-16le')   # explicit decode in little endian leaves BOM.
'\ufeff\u2708'
>>> for c in u: print(ud.name(c))
...
ZERO WIDTH NO-BREAK SPACE  # also known as BOM.
AIRPLANE

Attempting to discover unicode code point for a character

1 Answers1