-1

I have a character that when viewed in a hex editor is shown as:

FF  FE  08  27

meaning it's binary representation is (a four byte encoding):

11111111
11111110
00001000
00100111

Looking at the unicode table and description this doesn't seem to make sense, since a four byte encoding must have the leading byte in the form 11110xxx.

I'm most likely misunderstanding the unicode rules, but could you please let me know where I'm going wrong in determining the code point for this character?

m.edmondson
  • 30,382
  • 27
  • 123
  • 206
  • This is not [valid UTF-8](https://en.wikipedia.org/wiki/Utf-8#Codepage_layout). It looks more like UTF-16. – Karol S Jul 10 '14 at 20:42
  • I bit miffed as to the negative vote. – m.edmondson Jul 10 '14 at 22:03
  • I rolled back your edit to my answer. `U+FEFF` is correct and is the byte-order mark. That your raw data has it reversed is what indicates little-endian. The least significant byte is first. See http://en.wikipedia.org/wiki/Byte_order_mark. – Mark Tolonen Jul 11 '14 at 01:45

1 Answers1

4

The Unicode character U+FEFF is the byte order mark (BOM) indicating this is little-endian UTF-16 encoding, not UTF-8. The character is U+2708, or the AIRPLANE(✈️) character.

A little proof using Python 3:

>>> import unicodedata as ud
>>> s=b'\xff\xfe\x08\x27'
>>> s.decode('utf16')      # Removes BOM and uses indicated little-endian decode.
'\u2708'
>>> s.decode('utf-16le')   # explicit decode in little endian leaves BOM.
'\ufeff\u2708'
>>> for c in u: print(ud.name(c))
...
ZERO WIDTH NO-BREAK SPACE  # also known as BOM.
AIRPLANE
Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251