C++ Unicode: Bytes, Code Points and Graphemes

Question

So, I'm building a scripting language and one of my goals is convenient string operations. I tried some ideas in C++.

String as sequences of bytes and free functions that return vectors containing the code-points indices.
A wrapper class that combines a string and a vector containing the indices.

Both ideas had a problem, and that problem was, what should I return. It couldn't be a char, and if it was a string it would be wasted space.

I ended up creating a wrapper class around a char array of exactly 4 bytes: a string that has exactly 4 bytes in memory, no more nor less.

After creating this class, I felt tempted to just wrap it in a std::vector of it in another class and build from there, thus making a string type of code-points. I don't know if this is a good approach, it would end up being much more convenient but it would end up wasting more space.

So, before posting some code, here's a more organized list of ideas.

My character type would be not a byte, nor a grapheme but rather a code-point. I named it a rune like the one in the Go language.
A string as a series of decomposed runes, thus making indexing and slicing O1.
Because a rune is now a class and not a primitive, it could be expanded with methods for detecting unicode whitespace: mysring[0].is_whitespace()
I still don't know how to handle graphemes.

Curious fact! An odd thing about the way I build the prototype of the rune class was that it always print in UTF8. Because my rune is not a int32, but a 4 byte string, this end up having some interesting properties.

My code:

class rune {
    char data[4] {};
public:
    rune(char c) {
        data[0] = c;
    }

    // This constructor needs a string, a position and an offset!
    rune(std::string const & s, size_t p, size_t n) {
        for (size_t i = 0; i < n; ++i) {
            data[i] = s[p + i];
        }
    }

    void swap(rune & other) {
        rune t = *this;
        *this = other;
        other = t;
    }

    // Output as UTF8!
    friend std::ostream & operator <<(std::ostream & output, rune input) {
        for (size_t i = 0; i < 4; ++i) {
            if (input.data[i] == '\0') {
                return output;
            }
            output << input.data[i];
        }
        return output;
    }
};

Error handling ideas:

I don't like to use exceptions in C++. My idea is, if the constructor fails, initialize the rune as 4 '\0', then overload the bool operator explicitly to return false if the first byte of the run happens to be '\0'. Simple and easy to use.

So, thoughts? Opinions? Different approaches?

Even if my rune string is to much, at least I have a rune type. Small and fast to copy. :)

How do I use it? Last time I checked there's not that many information about it. — João Pires, Jan 17 '17 at 16:30
@DagobertoPires: "*Last time I checked*" Googling "char32_t C++" leads to plenty of information about it. The second link was an SO question. — Nicol Bolas, Jan 17 '17 at 16:35
How do I return it from a function? I believe that the main advantage of my idea is that's just a 4 byte string, thus easier to reason to about. Also, I know exactly how it's implemented. — João Pires, Jan 17 '17 at 16:38
This could be a worthwhile read for you: http://manishearth.github.io/blog/2017/01/14/stop-ascribing-meaning-to-unicode-code-points/ — davmac, Jan 17 '17 at 16:42
I've already read that and countless other articles on the subject. There's need to be some compromises and a code-point is not only fixed-with as it is the most similar to a character that we have now. All the other solutions like iterating or getting a vector of indices are not as convienient. The keyword here is convenience. I want to make a string where everything is abstracted way and where you just count characters like those in real life. Ideally this would end up being a string of graphemes clusters. — João Pires, Jan 17 '17 at 16:51
Your code has several problems as is: uninitialized `data[1] .. [3]` on construction from `char`, and no checking for UTF-8 validity whatsoever. It will acquire more problems as you start implementing `is_whitespace()` and the like, which are harder than you probably think. If you like to tinker and build stuff, go ahead, no one is going to stop you, but if you want a working Unicode implementation, use an existing one. — n. m. could be an AI, Jan 17 '17 at 16:52
@davmac damn straight. Codepoints do have meaning of course... they mean whatever the Unicode standard says they mean, but this is all too often not what programmers think they mean. 99% of the time people should be using UTF-8 strings as atomic entities. — n. m. could be an AI, Jan 17 '17 at 16:59
The question is, why do you want to count "characters" (whatever that means) in a string, or break it up into individual "characters"? Do you have a legitimate use case for that? There are surprisingly few out there. — n. m. could be an AI, Jan 17 '17 at 17:03
In my scripting language I want all the containers to behave similarly. Length seems to me something trivial to have. Convenience too. It just feels more natural for us humans. — João Pires, Jan 17 '17 at 17:16
Is it natural that `length("ã")` returns 2 in some cases, 1 in others? That's what codepoints do. Why do you view strings as containers of anything anyway? For most users strings should be atomic. — n. m. could be an AI, Jan 17 '17 at 17:25
All the strings would first be normalized so that they could be stored in a human-comprehensible way. The idea is that `ã` and `ñ` would always be a copdepoint. — João Pires, Jan 17 '17 at 17:32
Not all characters have normalized representation. So `á` will be one codepoint if it's latin a with acute, but two codepoints when it's a cyrillic a with stress sign. You cannot distinguish the two visually. Is that natural? Don't let me start talking about Hebrew or Devanagari scripts. You are still not saying *why* you would need all of this. Why not treat strings as atomic? No length, no characters. — n. m. could be an AI, Jan 17 '17 at 17:44
Because in my dreams I want a 100% abstracted string, I tell it to give me a character and it gives me a character as we humans perceive them. I really want to make a high level scripting language with the highest level possible of string with the most user friendly and easy to use interface possible. I don't want users to think in bytes or in code-points, I just want them to think in what we naturally perceive as characters and let the language itself deal with all the backstage stuff. — João Pires, Jan 17 '17 at 17:49
Ok, I quit, I'll just make everything bytes, and give users `nth`, `iterators`, and `vectors of indices` and let them suffer. All the default string functions like slicing and indexing will then be in bytes. — João Pires, Jan 17 '17 at 17:58
"in my dreams I want a 100% abstracted string" that's a very noble dream. Do you think you will be able to achieve it alone? You are at the start of your journey. There is a lot of libraries that operate with code points. The only (only!) thing you do differently (not necessarily better, just differently) is encoding. You store UTF8, they store UCS4. And because of this you will have to reimplement tens of thousand man-hour worth of coding thwy have accumulated. Good luck but permit me not to hold my breath. — n. m. could be an AI, Jan 17 '17 at 18:10
"it gives me a character as we humans perceive them". There is a whole lot of these pesky humans. And they perceive things *differently*. Even the same human in different situation. How dare they! — n. m. could be an AI, Jan 17 '17 at 18:12

Xirema · Answer 1 · 2017-01-17T17:00:46.853

It sounds like you're trying to reinvent the wheel.

There are, of course, two ways you need to think about text:

As an array of codepoints
As an encoded array of bytes.

In some codebases, those two representations are the same (and all encodings are basically arrays of char32_t or unsigned int). In some (I'm inclined to say "most" but don't quote me on that), the encoded array of bytes will use UTF-8, where the codepoints are converted into variable lengths of bytes before being placed into the data structure.

And of course many codebases simply ignore unicode entirely and store their data in ASCII. I don't recommend that.

For your purposes, while it does make sense to write a class to "wrap around" your data (though I wouldn't call it a rune, I'd probably just call it a codepoint), you'll want to think about your semantics.

You can (and probably should) treat all std::string's as UTF-8 encoded strings, and prefer this as your default interface for dealing with text. It's safe for most external interfaces—the only time it will fail is when interfacing with a UTF-16 input, and you can write corner cases for that—and it'll save you the most memory, while still obeying common string conventions (it's lexicographically comparable, which is the big one).
If you need to work with your data in codepoint form, then you'll want to write a struct (or class) called codepoint, with the following useful functions and constructors
- While I have had to write code that handles text in codepoint form (notably for a font renderer), this is probably not how you should store your text. Storing text as codepoints leads to problems later on when you're constantly comparing against UTF-8 or ASCII encoded strings.

code:

struct codepoint {
    char32_t val;
    codepoint(char32_t _val = 0) : val(_val) {}
    codepoint(std::string const& s);
    codepoint(std::string::const_iterator begin, std::string::const_iterator end);
    //I don't know the UTF-8→codepoint conversion off-hand. There are lots of places
    //online that show how to do this

    std::string to_utf8() const;
    //Again, look up an algorithm. They're not *too* complicated.
    void append_to_string_as_utf8(std::string & s) const;
    //This might be more performant if you're trying to reduce how many dynamic memory 
    //allocations you're making.

    //codepoint(std::wstring const& s);
    //std::wstring to_utf16() const;
    //void append_to_string_as_utf16(std::wstring & s) const;

    //Anything else you need, equality operator, comparison operator, etc.
};

"You can (and probably should) treat all `std::string`'s as UTF-8 encoded strings." There is that little-known OS called Windows out there... — n. m. could be an AI, Jan 17 '17 at 17:00
@n.m. I'd still recommend UTF-8 strings as the default storage medium, and wrap around any WinOS system calls with UTF-16→UTF-8 or UTF-8→UTF-16 conversions. It's the least error-prone solution. — Xirema, Jan 17 '17 at 17:08
The thing is, I supose that char32_t is implemented as an alias to an integer. My class is just an array, it behaves like an array. If I want to store a special whitespace character to later compare I can just initialize it with `"\xe3\x80\x80"` and then do the comparisons. The trick is in converting a `std::string` to a `std::vector`. Validation needs to be made, obviously. But in most cases it should work allright. All I need to do is get the codepoint indices, and then split the string into a `std::vector`. Plus, it prints in UTF8! — João Pires, Jan 17 '17 at 17:14

C++ Unicode: Bytes, Code Points and Graphemes

1 Answers1