You simply can't detect which UTF-what?
encoding is used by your data as UTF-???
is an encoding for UTF characters.
Fortunately, tagging the encoding of data is something that can be used in your data, but it's not mandatory. BOM
has been issued almost (see note) for that purpose, but you'll find plenty of UTF documents (in whatever encoding you can have) that lack of following this approach. The same bit pattern will mean different things in UTF-8, UTF-16 or UTF-32 encodings, so you'll have to search for encoding errors to discard (possibly them all) the invalid encodings to guess the correct one.
If the document has a BOM
mark at the beginning, that sequence of bytes will allow you to detect which encoding is being made, as it's representation gets different patterns depending on the actual encoding.
0xef, 0xbb, 0xbf => UTF-8 (no endianness)
0xfe, 0xff => UTF-16-BE (big endian)
0xff, 0xfe => UTF-16-LE (little endian)
0x00, 0x00, 0xfe, 0xff => UTF-32-BE (big endian)
0xff, 0xfe, 0x00, 0x00 => UTF-32-LE (little endian)
But as you can see, UTF-32-LE
begins the same as UTF-16-LE
, and UTF-32-BE
begins the same as UTF-16-BE
, so this doesn't answer completely your question. For example, a file with the sequence 0xff, 0xfe, 0x00, 0x00
is a perfectly valid file in UTF-32-LE
with no data (only the BOM) or a perfectly valid file in UTF-16-LE
with a Unicode Character 'NULL' U+0000 character.
The best approach is to pass the encoding as a parameter to the input routines, so they can decode the data appropiately.
Edit
In the example you use, the character CENT SIGN
is passed (well, i'll suppose you mean you have the character code U+00A2
which is represented as the number, in binary (completed to the 21 bit representation of a UTF code) 00000000010100010
. If you encode this character as a UTF-8
, you will get a two character encoding like: 0xe2, 0xa2
, when you encode it as a UTF-16-LE
you'll get: 0xa2, 0x00
, if you encode it as UTF-16-BE
, you'll get 0x00, 0xa2
, if you encode it as UTF-32-LE
, you'll get 0xa2, 0x00, 0x00, 0x00
and if you encode it as UTF-32-BE
, you'll get 0x00, 0x00, 0x00, 0xa2
. The problem here is that you are using sequences of 4 bytes to represent all unicodes, when encoding UTF-32 (and different order, depending if you do it on big endian encoding or little endian) and as sequences of 2 bytes when you do it in UTF-16 (almost all unicodes fall below the limit U+10000
, so almost all can be represented as single utf-16 codes, using the surrogate pairs, when they dont fit in 16 bits) and as sequences from 1 to 4 bytes, when using UTF-8 encoding. So, the first thing you have to know, is that the unicode CODE POINT is different from the encoding used to represent it (and this already has the encoding stuck to it) and so, you cannot know which encoding has been use to encode a unicode char, by making tests to the code point (the numerical order of the character in the whole unicode table)
Note
BOM
is an alternate use of the Not a character U+fffe
character. When it is put at the beginning of a document, it switches its meaning to represent de Byte Order Mark character
, so this is unfortunate, as it forces you to include it twice in case you want to begin a document with it. By the way, this character by definition is not a character so you will seldom see it as a normal document character. It is normally used as a substitution character when some decoding occurs in a UTF document.
CREDITS
The BOM
representation table has been drawn from the Wikipedia page