How to read a full UTF-8 hexadecimal values into an int

Question

What I am trying to do is get an int to take in an UTF-8-16-32 character, in doing so it should be able to tell whether it is UTF-8, UTF-16, or UTF-32.

I read binary values from a text file using fopen(fp, "rb"). I run into a problem where a single character is split into two bytes.

For instance, if I try to read a character CENT SIGN

The text file input.txt contains:

¢

I get:

utf code:       LATIN CAPITAL LETTER A WITH CIRCUMFLEX
binary:         11000010
hexadecimal:    0xC2
decimal:        194
character:      �

utf code:       CENT SIGN
binary:         10100010
hexadecimal:    0xA2
decimal:        162
character:      �

utf code:       LINE FEED (LF)
binary:         00001010
hexadecimal:    0xA
decimal:        10
character:

Code:

int ch;
while ((ch = fgetc(stream)) != EOF) {
    printf(“utf code:\t”);
    findCode(ch); // HERE

    write(1, “binary:         “, 16);
    printBits(ch);

    printf(“\nhexadecimal:\t%X”, ch);

    printf(“\ndecimal:\t%d”, ch);

    printf(“\ncharacter:\t%c\n\n”, ch);
}

NOTE: On a UTF-8 database, the binary value for CENT SIGN is 0xC2A2 or 11000010:10100010

It is unclear whether you are trying to read _UTF-8-encoded characters_ (ö) or _Unicode human-readable codepoint notation_ (U+00F6) or _Unicode character names_ (LATIN SMALL LETTER O WITH DIARESIS). Each requires a completely different algorithm. Please clarify. — zwol, Nov 13 '17 at 22:59
You are confusing things. There's no such thing as a "UTF-8 character" or a "UTF-16 character". There are only Unicode code points, which are 21 bits wide. You can encode this into a stream of 8-, 16-, or 32-bit ints in a number of ways. UTF-8 is one such encoding. UTF-16 is another, as are UCS-2 and UCS-4. But there is only one Unicode character set--that's the whole point. — Lee Daniel Crocker, Nov 13 '17 at 23:00
So @LeeDanielCrocker there is one unicode character set, UTF-8, or UTF-16 specifies bit field width? Is that correct? — compolo, Nov 13 '17 at 23:07
UTF-8, etc., are *encodings*. They are like little languages--ways of taking a sequence of Unicode code points and and compressing them into a stream of bytes in such a way that they can be uniquely decoded back to the originals. — Lee Daniel Crocker, Nov 13 '17 at 23:20
None of you are helpful. If you do not know how to solve a problem or answer problems, do not comment. And also, Thank You. — compolo, Nov 14 '17 at 04:16
You cannot tell from the binary data if it is UTF-8, UTF-16 or UTF-32. You can only try to guess from the input in a two pass way. — Luis Colorado, Nov 14 '17 at 09:33

score 1 · Answer 1 · answered Nov 13 '17 at 22:48

1

The problem is that fgetc will only read 1 byte.

answered Nov 13 '17 at 22:48

Unh0lys0da

196
8

So because the bit field of certain characters are longer than a normal char can hold it splits it in two? Well how am I supposed to get a single unicode character? – compolo Nov 13 '17 at 23:00
Perhaps this tutorial might proof useful: https://www.cprogramming.com/tutorial/unicode.html – Unh0lys0da Nov 13 '17 at 23:02
Unh0lys0da: That tutorial is terrible and wrong in so many ways (e.g., no, a wide character does NOT always store a Unicode code point). – torstenvl Nov 14 '17 at 04:08
None of you are helpful. If you do not know how to solve a problem or answer problems, do not comment. Thank you. – compolo Nov 14 '17 at 04:16

score 0 · Answer 2 · edited Jun 20 '20 at 09:12

You simply can't detect which UTF-what? encoding is used by your data as UTF-??? is an encoding for UTF characters.

Fortunately, tagging the encoding of data is something that can be used in your data, but it's not mandatory. BOM has been issued almost (see note) for that purpose, but you'll find plenty of UTF documents (in whatever encoding you can have) that lack of following this approach. The same bit pattern will mean different things in UTF-8, UTF-16 or UTF-32 encodings, so you'll have to search for encoding errors to discard (possibly them all) the invalid encodings to guess the correct one.

If the document has a BOM mark at the beginning, that sequence of bytes will allow you to detect which encoding is being made, as it's representation gets different patterns depending on the actual encoding.

0xef, 0xbb, 0xbf  => UTF-8 (no endianness)
0xfe, 0xff        => UTF-16-BE (big endian)
0xff, 0xfe        => UTF-16-LE (little endian)
0x00, 0x00, 0xfe, 0xff => UTF-32-BE (big endian)
0xff, 0xfe, 0x00, 0x00 => UTF-32-LE (little endian)

But as you can see, UTF-32-LE begins the same as UTF-16-LE, and UTF-32-BE begins the same as UTF-16-BE, so this doesn't answer completely your question. For example, a file with the sequence 0xff, 0xfe, 0x00, 0x00 is a perfectly valid file in UTF-32-LE with no data (only the BOM) or a perfectly valid file in UTF-16-LE with a Unicode Character 'NULL' U+0000 character.

The best approach is to pass the encoding as a parameter to the input routines, so they can decode the data appropiately.

Edit

In the example you use, the character CENT SIGN is passed (well, i'll suppose you mean you have the character code U+00A2 which is represented as the number, in binary (completed to the 21 bit representation of a UTF code) 00000000010100010. If you encode this character as a UTF-8, you will get a two character encoding like: 0xe2, 0xa2, when you encode it as a UTF-16-LE you'll get: 0xa2, 0x00, if you encode it as UTF-16-BE, you'll get 0x00, 0xa2, if you encode it as UTF-32-LE, you'll get 0xa2, 0x00, 0x00, 0x00 and if you encode it as UTF-32-BE, you'll get 0x00, 0x00, 0x00, 0xa2. The problem here is that you are using sequences of 4 bytes to represent all unicodes, when encoding UTF-32 (and different order, depending if you do it on big endian encoding or little endian) and as sequences of 2 bytes when you do it in UTF-16 (almost all unicodes fall below the limit U+10000, so almost all can be represented as single utf-16 codes, using the surrogate pairs, when they dont fit in 16 bits) and as sequences from 1 to 4 bytes, when using UTF-8 encoding. So, the first thing you have to know, is that the unicode CODE POINT is different from the encoding used to represent it (and this already has the encoding stuck to it) and so, you cannot know which encoding has been use to encode a unicode char, by making tests to the code point (the numerical order of the character in the whole unicode table)

Note

BOM is an alternate use of the Not a character U+fffe character. When it is put at the beginning of a document, it switches its meaning to represent de Byte Order Mark character, so this is unfortunate, as it forces you to include it twice in case you want to begin a document with it. By the way, this character by definition is not a character so you will seldom see it as a normal document character. It is normally used as a substitution character when some decoding occurs in a UTF document.

CREDITS

The BOM representation table has been drawn from the Wikipedia page

Mu\y question is, how am I to do this? Syntactically? I understand now the definitions of these things. There is no `BOM` in this file. It can literally have anything in it. 0x00A2 is the CENT SIGN. But for utf8 encoding 0xC2A2 is the CENT SIGN. So that means where it used utf8 before, at any time it can switch up and use utf16. — compolo, Nov 14 '17 at 10:10
https://en.wikipedia.org/wiki/UTF-8#Description the table in this wikipedia article describes how to distinguish the encodings reasonably. I am interested in knowing how to do something like this with bitwise operations or whatever is necessary — compolo, Nov 14 '17 at 10:12
The first thing I say is _simply you cannot_. the same bit pattern means different thing in different encodings.... so the same encoded bit pattern can be decode as different chars when you apply different decoding routines. See my edit of the answer. — Luis Colorado, Nov 14 '17 at 10:15
@compolo, better, go to the [Unicode consortium]: to search about unicode documentation. Download the standard and look for the encoding an the decoding process.... and then, you'll have the answer I'm trying to explain with examples. **YOU CANNOT GUESS THE ENCODING FROM THE CODE POINT OF A CHARACTER** You are confounding the numerical code of a unicode character (code point) with the byte representation when it is transmitted as bytes (the encoding) and they are different... of course.... but you cannot get the encoding from the byte representation. — Luis Colorado, Nov 14 '17 at 10:34
I cannot get the encoding from byte representation. That is to say that bytes in forms of bit strings or hexadecimal values are irrelevant? — compolo, Nov 14 '17 at 10:44
I really am just trying to find out how to do multibyte manipulation. It is a bit confounding when you say things like bytes aren't important. Or "I cannot get encoding from the byte representation" — compolo, Nov 14 '17 at 10:46
Well, this discussion would be better to be held more interactively, like a chat or something like that..... when you talk about code points the idea is that they are numbers... nothing related to byte representation. It's when you encode them when they begin to show those weird things. Encodings are designed to allow to recognize surrogate pairs (in UTF-16) or multibyte sequences (in UTF-8) by going forward and backwards. But you have to know what they are and how to recognize them, to be able to do actual decoding. — Luis Colorado, Nov 14 '17 at 11:46
Well yes. But I guess I need to be able to do decoding as that is more relevant to what I am trying to do — compolo, Nov 14 '17 at 16:15
...but there's no egg-chicken problem here... you need to know _how the characters are encoded_ by external means and in advance... then apply that encoding to the data coming. And not the opposite, what shows to be what you are trying to do. The same bytecodes can be parsed by different decoders, giving different code points, so not knowing the encoding can lead to different results, all of them valid. That's the reason I say _not, you can't..._ above. — Luis Colorado, Nov 15 '17 at 09:09

How to read a full UTF-8 hexadecimal values into an int

2 Answers2

Edit

Note

CREDITS