-1

I have a file which begins as below (hex from od -x <filename> )

8fae 3800 7c00 2200 4300 6800 6100 7200

corresponding characters are

®8 | " C h a r

It was expected to be 8|"Char, starting with number 8 and a pipe character and so on.

  1. Is the first two bytes 8fae some kind of header or BOM?
  2. Can I assume the encoding is UTF-16?
dbza
  • 316
  • 1
  • 5
  • 19
  • 1
    Looks like `UTF-16` to me. And that first character could always be 辮. – The name's Bob. MS Bob. Apr 16 '15 at 23:32
  • 1
    I think you mean, "how do you _guess_?" If you don't know, you don't know. CP437 can decode any sequence of any byte values (unlike any Unicode encoding, Windows-1252, Windows-1251 to name a few). – Tom Blodget Apr 17 '15 at 01:09
  • `8f ae` is not `®` in UTF-16, it is `꺏` instead. You will have to ask the guy who wrote the file what `8f ae` actually represents. Most likely a binary header of some kind, but definitely not a BOM. The rest of the data shown is indeed UCS-2/UTF-16, though. – Remy Lebeau Apr 17 '15 at 02:37

1 Answers1

1

They first characters may be BOM though they don't look familiar. UTF-8 uses 0xEF,0xBB,0xBF while UTF-16 uses U+FEFF or 0xFE,0xFF.

Keep in mind BOM is optional for UTF-8 (i.e. there's UTF-8 with BOM, and there's UTF-8 without BOM). So unfortunately, when there's no BOM it's difficult to safely identify a file's encoding. Some libraries or plugins use character dictionaries to guess encodings.

JavoSN
  • 464
  • 3
  • 12