2

Is there a list of possible BOM characters that are used? So far I have encountered:

\x00\x00\xfe\xff    UTF-32, big-endian
\xff\xfe\x00\x00    UTF-32, little-endian
\xfe\xff            UTF-16, big-endian
\xff\xfe            UTF-16, little-endian
\xef\xbb\xbf        UTF-8

Are there any additional ones that I'm missing?

  • You have all possible values on the actual official unicode encoding. If you have an additional encoding, you should encode BOM and find which "BOM bytes" you will get. E.g. in past we had UTF-7. But I think nobody put a BOM on such strings (and possibly you will never find a UTF-7 text). But so, you should check the other way: which encoding, and then you have BOMs – Giacomo Catenazzi Jan 14 '19 at 14:50

1 Answers1

3

Short answer: no, you've covered them.

According to the Unicode spec, UTF-8, UTF-16, and UTF-32 are the 3 general types of encodings. They actually list UTF-16, UTF-16LE, and UTF-16BE as separate encodings, and similarly UTF-32, UTF-32LE, and UTF-32BE.

It's important to know that if the character stream is explicitly coded in one of the LE or BE forms, you must interpret the leading 0xFFFE as U+FEFF Zero Width No-Break Space. I.e.

UTF-16BE  initial FE FF is treated as U+FEFF
UTF-16LE  initial FF FE is treated as U+FEFF
UTF-32BE  initial 00 00 FE FF is treated as U+FEFF
UTF-32LE  initial FF FE 00 00 is treated as U+FEFF

See http://www.unicode.org/versions/Unicode11.0.0/ch03.pdf#G2212 for more details.

J Quinn
  • 319
  • 2
  • 8
  • Where do you find the second statement? On D14, it is say that U+FEFF is reserved. IIRC old text could contain U+FEFF as zero width non-break space, but we are talking with old unicode version, so practically non existent (and the encoding should be externally defined) – Giacomo Catenazzi Jan 14 '19 at 14:45