Any solution to this is going to be heuristic-based. But in general, UTF-8 has the following byte sequences (available in man utf8
):
0x00000000 - 0x0000007F:
0xxxxxxx
0x00000080 - 0x000007FF:
110xxxxx 10xxxxxx
0x00000800 - 0x0000FFFF:
1110xxxx 10xxxxxx 10xxxxxx
0x00010000 - 0x001FFFFF:
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
So your heuristic can look a few bytes ahead, and see if the bytes follow one of four patterns (UTF-8 in theory supports byte sequences stretching to six characters, but in practice only uses four):
0*
(you'll have to be careful to distinguish this from regular ASCII files)
110*
, 10*
1110*
, 10*
, 10*
11110*
, 10*
, 10*
, 10*
Checking for these is easy:
To check if a unsigned char a
fits one of these patterns, run:
- For
10*
- the most frequent pattern - use (a >> 6) == 0x2
.
- For
0*
- use (a >> 7) == 0x0
.
- For
110*
- use (a >> 5) == 0x6
.
- For
1110*
- use (a >> 4) == 0xe
.
- For
11110*
- use (a >> 3) == 0x1e
.
All we're doing is shifting the bits to the right and checking if they're equal to the bits in the UTF-8 byte sequences.