Can one guess character encoding looking at binary/hex data?

Question

I have a file which begins as below (hex from od -x <filename> )

8fae 3800 7c00 2200 4300 6800 6100 7200

corresponding characters are

®8 | " C h a r

It was expected to be 8|"Char, starting with number 8 and a pipe character and so on.

Is the first two bytes 8fae some kind of header or BOM?
Can I assume the encoding is UTF-16?

Looks like `UTF-16` to me. And that first character could always be 辮. — The name's Bob. MS Bob., Apr 16 '15 at 23:32
I think you mean, "how do you _guess_?" If you don't know, you don't know. CP437 can decode any sequence of any byte values (unlike any Unicode encoding, Windows-1252, Windows-1251 to name a few). — Tom Blodget, Apr 17 '15 at 01:09
`8f ae` is not `®` in UTF-16, it is `꺏` instead. You will have to ask the guy who wrote the file what `8f ae` actually represents. Most likely a binary header of some kind, but definitely not a BOM. The rest of the data shown is indeed UCS-2/UTF-16, though. — Remy Lebeau, Apr 17 '15 at 02:37

score 1 · Answer 1 · answered Apr 16 '15 at 23:44

They first characters may be BOM though they don't look familiar. UTF-8 uses 0xEF,0xBB,0xBF while UTF-16 uses U+FEFF or 0xFE,0xFF.

Keep in mind BOM is optional for UTF-8 (i.e. there's UTF-8 with BOM, and there's UTF-8 without BOM). So unfortunately, when there's no BOM it's difficult to safely identify a file's encoding. Some libraries or plugins use character dictionaries to guess encodings.

Can one guess character encoding looking at binary/hex data?

1 Answers1