5

as I know linux uses UTF-8 encoding. This means I can use std::string for handling string right? Just the encoding will be UTF-8.

Now on UTF-8 we know some characters are 1 byte some 2,3.. bytes. My question is: how to you deal with UTF-8 encoded string on Linux using C++?

Particularly: how would you get length of string say in bytes (or number of characters)? How would you traverse the string? etc.

The reason I am asking is that as I said on UTF-8 characters may be more than one byte right? So obviously myString[7] and myString[8] - might not refer to two different characters. Also fact that UTF-8 string is ten bytes, doesn't say much about its number of characters right?

Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214
  • If you need random access to charaters it's better to convert the string to `wstring` which is UTF-32 encoded in gcc. – Tobias Brandt Oct 04 '13 at 13:18
  • 3
    @TobiasBrandt [**No!**](http://www.utf8everywhere.org/) – Konrad Rudolph Oct 04 '13 at 13:22
  • 1
    @Konrad Rudolph: I agree that UTF-8 should be used for transport/storage everywhere. But if you need *random* access, you cannot use UTF-8. UTF-32 is the correct choice in that case. – Tobias Brandt Oct 04 '13 at 13:25
  • @KonradRudolph If they told you to jump off of the cliff, would you ? You are being far too dogmatic. OP wants to make calculations on his strings, not store/share/send them. While we can recommend to use UTF-8 as much as possible, UTF-32 makes perfect sense in this context. –  Oct 04 '13 at 13:26
  • 8
    @TobiasBrandt Your argument is true, your conclusion isn’t. The correct answer is: don’t use random access. You do not need it. But even if you somehow did, UTF-32 is not the answer because you still need to handle normalisation, combining characters etc. Just using `wstring` doesn’t solve the issue, it *ignores* it. – Konrad Rudolph Oct 04 '13 at 13:29
  • Plus the fact that nothing guarantees that wchar_t is 32 bits. – john Oct 04 '13 at 13:30
  • @Tibo You imply that I am mindlessly following the article I linked to. I’m not, it just presents the argument in concise form. – Konrad Rudolph Oct 04 '13 at 13:30
  • 2
    Can you explain why you need random access to the "characters"? (you may want to provide your definition of "character" for better discourse) That explanation may help in finding a better solution or it may help you realise you don't really need it. – R. Martinho Fernandes Oct 04 '13 at 13:37
  • @R. Martinho Fernandes, maybe he wants to replace some character in utf-8 string? – kvv Oct 04 '13 at 13:41
  • 1
    @kvv which one? Usually you search for it first and keep some sort of iterator/index to the right position. You don't go and jump to the "the 3rd character" and replace it. – R. Martinho Fernandes Oct 04 '13 at 13:42
  • 1
    @R. Martinho Fernandes: ok good point. Honestly that I think now I would need to count number of bytes. *Maybe* also number of characters. In any case I made a general inquiry also, but I see it's not so trivial. So maybe I address specific issues later when/if I encounter it (e.g., access random character in string etc.). –  Oct 04 '13 at 13:44
  • Linux doesn't "use UTF-8 encoding", it does only if you set it to use it — ok, that's probably what most distros do by default nowadays, but anyway. – Skippy le Grand Gourou Oct 04 '13 at 14:18
  • 1
    @SkippyleGrandGourou, any distro that doesn't should be taken out back and shot. There's simply too much good that comes from standardizing on UTF-8. I only wish Windows would get the memo. – Mark Ransom Oct 04 '13 at 14:36
  • @MarkRansom I don't know if you've noticed but MS has been busy bloating the user space but never changed anything "system" since 1993. They seldom do some tuning to the scheduler one has got to admit. one time in XP to support hyperthreading, one time in vista to become fairer wrt traps, and one time in win8 to become tickless. otherwise it's just NT3. – v.oddou Oct 01 '15 at 06:38

5 Answers5

6

You cannot handle UTF-8 with std::string. string, despite its name, is only a container for (multi-) bytes. It is not a type for text storage (beyond the fact that a byte buffer can obviously store any object, including text). It doesn’t even store characters (char is a byte, not a character).

You need to venture outside the standard library if you want to actually handle (rather than just store) Unicode characters. Traditionally, this is done by libraries such as ICU.

However, while this is a mature library, its C++ interface sucks. A modern approach is taken in Ogonek. It’s not as well established and still work in progress, but provides a much nicer interface.

Glorfindel
  • 21,988
  • 13
  • 81
  • 109
Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214
  • Isn't utf-8 a multi-byte encoding ? While I understand what you really mean, do not say "cannot handle", because this depends on the context. –  Oct 04 '13 at 13:29
  • 4
    @Tibo How does it depend on context? I agree that `std::string` can *carry* UTF-8. But no, it cannot *handle* it. – Konrad Rudolph Oct 04 '13 at 13:31
  • Let's not make this degenerate into a pointless argument about the definition of "handling". Apples here, oranges there. –  Oct 04 '13 at 13:36
  • It can only handle *some* UTF-8 strings, and thus it can't handle UTF-8 in general. – thecoshman Oct 04 '13 at 13:56
  • It really does depend on the context to some extent. Concatenating two strings known to be well-formed should work fine, for example. – Mark Ransom Oct 04 '13 at 14:40
  • @Mark Yes (See comments on deleted answer). But it’s pretty much the only operation that works. So I’d still say that it’s an operation on the container rather than the text content. – Konrad Rudolph Oct 04 '13 at 14:46
  • @KonradRudolph Writing here because I deleted my answer, but byte-oriented searching for UTF-8 does work - it's listed as [one of the advantages](http://en.wikipedia.org/wiki/UTF-8#Advantages_3) of UTF-8 in the wiki. And your example letters "ä" and "ä" are different even in UCS-2 - I checked with a hex editor. They are not equal even as `wchar_t`. – sashoalm Oct 04 '13 at 14:50
  • @sashoalm That they’re not equal in UCS-2 was kind of my point though. UCS-2 suffers from the same problems as UTF-8 (and then some). And the Wikipedia article is simply wrong. – Konrad Rudolph Oct 04 '13 at 14:52
  • @KonradRudolph OK, would you expect 'a' to be equal to 'а'? Hint - the second 'a' is Cyrillic. That they could be visually the same is irrelevant. And I suspect that this `ICU` tool would say that "ä" and "ä" are different, too. – sashoalm Oct 04 '13 at 14:55
  • That byte-oriented search in UTF-8 will *not generate false positives* is its advantage over other legacy encodings. Whether not finding ä" in "ä" is a false negative depends on what you are doing, but any process that claims conformance with the Unicode Standard must treat the two equivalently. – R. Martinho Fernandes Oct 04 '13 at 14:59
  • @sashoalm See Martinho’s comment. Latin-a and Cyrillic-а are *logically different* characters even if they happen to have the same visual rendition. “ä” (letter-with-diaeresis) and “ä” (letter, combining diaeresis) are *logically identical* even if they are physically stored differently. That’s the difference. And ICU, properly used, will treat them correctly. – Konrad Rudolph Oct 04 '13 at 15:05
  • @sashoalm just FWIW one of the reasons the Cyrillic alphabet is separate, even when some letters are "the same", is that merging them would cause some headaches. Consider a lowercase conversion. Cyrillic ve (В) lowercases to в, but latin B lowercases to b. (And ve only visually matches B; usually it's be (Б) that corresponds to B). 'A' happens to match fine in both alphabets, but I guess they decided to encode them all separately for consistency. – R. Martinho Fernandes Oct 04 '13 at 15:16
  • @R.MartinhoFernandes OK, just to clear things up for me, since “ä” (letter-with-diaeresis) and “ä” (letter, combining diaeresis) are logically the same, does that mean they would be byte-wise same in UCS-4, or in any kind of encoding? Or are they separate symbols in the Unicode table itself, but logically the same? – sashoalm Oct 04 '13 at 15:20
  • @sashoalm First off, forget you ever heard the terms UCS-2 and UCS-4. These things shouldn’t exist any more, they’re legacy and dead. Secondly, UTF-16 and UTF-32 *also* treat them differently, because there are simply different ways of representing combined letters in Unicode as code points. In UTF-32, the only difference to UTF-8 is that every code point is represented by a fixed-width four-byte sequence, whereas UTF-8 uses a variable-width representation for code points. But this says nothing about characters. – Konrad Rudolph Oct 04 '13 at 15:22
  • @sashoalm, if you want logically equivalent strings to also be bitwise equivalent you can [normalize them](http://www.unicode.org/faq/normalization.html). – Mark Ransom Oct 04 '13 at 15:27
3

You may want to convert the UTF-8 encoded strings to some kind of fixed width encoding prior to manipulating them. But that depends on what you are trying to do.

To get the length in bytes of a UTF-8 string that's just str.size(). To get the length in chars is slightly more difficult but you can get that by ignoring any byte in the string which has a value >= 0x80 and < 0xC0. In UTF-8 those values are always trailing bytes. So count the number of bytes like that and subtract it from the size of the string.

The above does ignore the issue of combining characters. It does rather depend on what your definition of character is.

john
  • 85,011
  • 4
  • 57
  • 81
  • 1
    In particular your counting algorithm will return the number of *codepoints*, not characters. Those of us in the English-speaking world don't often notice the difference. – Mark Ransom Oct 04 '13 at 14:43
  • @john: thanks. But this doc: http://www.cplusplus.com/reference/string/string/size/ says `size` returns number of characters, not bytes, so how would it correctly calculate number of bytes? –  Oct 07 '13 at 13:08
  • @dmcr_code The confusion is because there are different definitions of character. In a UTF-8 string one character can be more than one byte. The reference you quote is assuming that one character is one byte, but that's not true for you. My answer is correct, apart from the point that Mark Ransom mentioned, he's using yet another definition of character. Why don't you write some test code? – john Oct 07 '13 at 13:14
  • @dmcr_code `std::string` has no in built understanding of UTF-8. How could it even know that you are using UTF-8? Therefore `std::string::size` cannot possibly return the number of UTF-8 characters in a string. – john Oct 07 '13 at 13:17
  • @john: I just said when you suggested: "I could use str.size() to get number of bytes of string", that str.size() returns number of characters (std::size), so it probably would not return correct number of bytes of a UTF 8 string. isn't it? for ANSi string, it would, since sizeof character is one byte –  Oct 07 '13 at 14:04
  • @dmcr_code `str.size()` returns the number of bytes used in a `std::string` because `sizeof(char) == 1` by definition. – john Oct 07 '13 at 14:46
  • @john: dear john that is what I said, I know str.size() will be equal to size of string in bytes because sizeof(char)=1, but will str.size() give correct size of string in bytes also in case of UTF-8 string? that was my question. –  Oct 07 '13 at 16:38
  • Yes it will, bytes are just bytes. Make no difference if your string is UTF-8 or anything else. – john Oct 07 '13 at 21:08
2

There are multiple concepts here:

  1. length of UTF-8 encoding in bytes
  2. number of Unicode code points used (= number of UTF-8 bytes outside the 0x80..0xbf range)
  3. number of glyphs ("characters" in Western languages)
  4. screen space occupied when displaying

Normally, you are only interested in 1. (for memory requirements) and 4. (for display), the others have no real application.

The amount of screen space can be queried from the rendering context. Note that this may change depending on context (for example, Arabic letters change shape at the beginning and end of words), so if you are doing text input, you may need to perform additional trickery to give users a consistent experience.

Simon Richter
  • 28,572
  • 1
  • 42
  • 64
  • I don't agree. A proper UTF-8 string should give you access to 1, 2 and 3 because they are all important when dealing with text while 4 is non of the string's business. – RecursiveExceptionException Oct 23 '17 at 20:08
  • @NullExceptionPointer, 2 and 3 tell me exactly nothing about the string, unless I'm making assumptions about the language used. – Simon Richter Oct 24 '17 at 01:02
  • Fair enough, I was assuming utf-8 encoded unicode but a string may contain anything so that would be quite silly. – RecursiveExceptionException Oct 25 '17 at 16:51
  • @NullExceptionPointer, Unicode can encode so many different things now, for example all the emoji, some of which have additional modifiers, or country codes that are always two Unicode code points, like is encoded as U+1F1E8 U+1F1ED. If you want to know how many glyphs there are, you need a full Unicode table. – Simon Richter Oct 25 '17 at 17:03
1

I'm using libunistring library, which can help you deal with all your questions.

For example, here is simple string length (in utf-8 characters) function:

size_t my_utf8_strlen(uint8_t *str) {
    if (str == NULL) return 0;
    if ((*str) == 0) return 0;

    size_t length = 0;
    uint8_t *current = str;
    // UTF-8 character.
    ucs4_t ucs_c = UNINAME_INVALID;

    while (current && *current) {
        current = u8_next(&ucs_c, current);
        length++; 

        // Broken character.
        if (ucs_c == UNINAME_INVALID || ucs_c == 0xfffd) 
        return length - 1;
    }

    return length;
}

// Use case
std::string test;

// Loading some text in `test` variable.
// ...

std::cout << my_utf8_strlen(&test[0]) << std::endl;
Artem Agasiev
  • 175
  • 1
  • 8
0

You can determine it based on the major x bits of the first byte: UTF-8, Description

Kent Munthe Caspersen
  • 5,918
  • 1
  • 35
  • 34
  • You can determine what exactly? Without a Unicode character database, it's impossible to know how many characters are in a UTF-8 sequence. – Nikos C. Oct 04 '13 at 13:29
  • To his question: "Particularly: how would you get length of string say in bytes (or number of characters)? How would you traverse the string? etc.", I recommend that you traverse the bytes one by one. If the first byte you meet is on the form 11110xxx, the next 3 bytes you meet belongs to the same string. After seeing 3 bytes, a new character begins, repeat over. – Kent Munthe Caspersen Oct 04 '13 at 13:33
  • @NikosC. No need for the character map. It’s true that you only need the first few bits of every byte to determine whether it’s a continuation. That’s enough to figure out the length in code points (that’s still not necessarily the length of the word, though). – Konrad Rudolph Oct 04 '13 at 13:34
  • 1
    Doesn't work that way. For example, the string "άβ" might be recognized as being three characters long. In reality, it's only two. That's because "ά" can consist of two characters, `α` and `'`. You can't determine whether a character actually represents part of another just by looking at the bits. You need a database for it. Hence libraries like ICU have been developed, which do the lookup for you. – Nikos C. Oct 04 '13 at 14:00
  • @Nikos what Konrad said is correct, because he was careful enough to pick the accurate words. – R. Martinho Fernandes Oct 04 '13 at 14:02
  • @R.MartinhoFernandes Makes no sense to me. How do the bits make a character belong to a string? A string is just the construct we use to store the sequence in C. – Nikos C. Oct 04 '13 at 14:04
  • 2
    Number of bytes != number of codepoints != number of characters – thecoshman Oct 04 '13 at 14:05
  • @NikosC. Yes, I agree and take my objection back. I actually used the exact same argument in a comment on another (now deleted) answer, to argue that you cannot use byte comparisons when searching for strings. It would be inconsistent to now argue the opposite. – Konrad Rudolph Oct 04 '13 at 14:08