Using bit shifting to guess UTF-8 encoding

Question

I am writing a program like file(1) that can guess if a text file contains ascii character, ISO-8859-1 characters, or UTF-8. Ive already programmed it to guess ascii and ISO, only UTF-8 remains. My problem is I am supposed to be using bit-shifting, and while I know the very basics of bit-shifting, I am having trouble figuring it out how to use it for guessing UTF-8 characters. I am of course not asking for a solution, but if someone could push me in the right direction, I would be pleased!

I am writing in C.

You are writing in C, but neither the question nor answer depend on that. It might be worth considering tagging it with algorithm instead. — David Conrad, Sep 16 '21 at 17:46
Bit-shifting would be an implementation detail, not a key characteristic of an heuristic for recognizing UTF-8. Do not focus on that. — John Bollinger, Sep 16 '21 at 18:00
Note also that it is possible for a file to be expressed in US-ASCII, ISO-8859-1, and UTF-8 *all at the same time*. Only if it contains characters outside the (7-bit) range of ASCII will you see differences between the ISO-8859 encodings and UTF-8. — John Bollinger, Sep 16 '21 at 18:04
I think it is an homework, so it has not to be efficient. But bit-shifting on UTF-8 is "interesting". So what i would do: try to decode it as UTF-8 (it may fail) and then try to re-encode it. (it should never fail). check if you have the same result (it may fail: common method to hide special characters, e.g. encoding ASCII with 2, 3, or 4 UTF-8 sequences (which are technically not valid). — Giacomo Catenazzi, Sep 17 '21 at 07:09

score 6 · Accepted Answer · edited Sep 16 '21 at 19:56

6

Any solution to this is going to be heuristic-based. But in general, UTF-8 has the following byte sequences (available in man utf8):

0x00000000 - 0x0000007F:
    0xxxxxxx
0x00000080 - 0x000007FF:
    110xxxxx 10xxxxxx
0x00000800 - 0x0000FFFF:
    1110xxxx 10xxxxxx 10xxxxxx
0x00010000 - 0x001FFFFF:
    11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

So your heuristic can look a few bytes ahead, and see if the bytes follow one of four patterns (UTF-8 in theory supports byte sequences stretching to six characters, but in practice only uses four):

0* (you'll have to be careful to distinguish this from regular ASCII files)
110*, 10*
1110*, 10*, 10*
11110*, 10*, 10*, 10*

Checking for these is easy:

To check if a unsigned char a fits one of these patterns, run:

For 10* - the most frequent pattern - use (a >> 6) == 0x2.
For 0* - use (a >> 7) == 0x0.
For 110* - use (a >> 5) == 0x6.
For 1110* - use (a >> 4) == 0xe.
For 11110* - use (a >> 3) == 0x1e.

All we're doing is shifting the bits to the right and checking if they're equal to the bits in the UTF-8 byte sequences.

edited Sep 16 '21 at 19:56

ikegami

367,544
15
269
518

answered Sep 16 '21 at 17:35

Daniel Kleinstein

5,262
1
22
39

2

Although UTF8 _could_ technically support byte sequences longer than 4 bytes, it doesn't. Codepoints above U+10FFFF aren't valid. – Ted Lyngmo Sep 16 '21 at 17:37
2

@TedLyngmo Woops, fixed. – Daniel Kleinstein Sep 16 '21 at 17:41
Small nitpick: right shifting negative values can be problematic. For example, on a system where a `char` is a signed 2's complement number, the following code will print "nope". `char a = 0x80; if ((a >> 6) == 2) printf("yup\n"); else printf("nope\n");` – user3386109 Sep 16 '21 at 19:16
Between UTF-8 and cp1252 (and this iso-8859-1), the heuristics are extremely reliable. See [this answer](https://stackoverflow.com/a/28681865/589924) for the limits of handling a file that contains a mix of these. It's so reliable that you can literally fix a file that contains a mix of these two encodings. If only one encoding if used, [here are those limits](https://stackoverflow.com/a/22868803/589924). You'll probably get similar results when pitting UTF-8 against one of the other 8-bit code pages or one of the encodings in the iso-8859 family. – ikegami Sep 16 '21 at 20:00
1

@user3386109: And for exactly that reason, it is strongly suggested to handle UTF-8 in *unsigned* chars. Actually, C++20 made that official with `char8_t`, which is specified as unsigned. – DevSolar Sep 16 '21 at 20:00
@DevSolar I wonder if `char8_t` is planned for the next version of the C standard. That would be a nice addition. – user3386109 Sep 16 '21 at 20:04
@user3386109, I hope not. C already has `uint8_t`, and `char8_t` is a misleading name if it's always unsigned. – ikegami Sep 16 '21 at 20:04
@ikegami: `uint8_t` is an optional type for storing an unsigned integer value of exactly 8 bit. A machine with `CHAR_BIT != 8` wouldn't have it. `char8_t` (in C++20) is a mandatory type capable of storing an UTF-8 code unit. It completes the already existing `char16_t` (UTF-16) and `char32_t` (UTF-32). It's basically a variation on `uint_least8_t`, but explicitly for Unicode code units. Not misleading in the least. – DevSolar Sep 17 '21 at 08:25

Using bit shifting to guess UTF-8 encoding

1 Answers1