So, I'm building a scripting language and one of my goals is convenient string operations. I tried some ideas in C++.
- String as sequences of bytes and free functions that return vectors containing the code-points indices.
- A wrapper class that combines a string and a vector containing the indices.
Both ideas had a problem, and that problem was, what should I return. It couldn't be a char, and if it was a string it would be wasted space.
I ended up creating a wrapper class around a char array of exactly 4 bytes: a string that has exactly 4 bytes in memory, no more nor less.
After creating this class, I felt tempted to just wrap it in a std::vector
of it in another class and build from there, thus making a string type of code-points. I don't know if this is a good approach, it would end up being much more convenient but it would end up wasting more space.
So, before posting some code, here's a more organized list of ideas.
- My character type would be not a byte, nor a grapheme but rather a code-point. I named it a rune like the one in the Go language.
- A string as a series of decomposed runes, thus making indexing and slicing O1.
- Because a rune is now a class and not a primitive, it could be expanded with methods for detecting unicode whitespace:
mysring[0].is_whitespace()
- I still don't know how to handle graphemes.
Curious fact! An odd thing about the way I build the prototype of the rune class was that it always print in UTF8. Because my rune is not a int32, but a 4 byte string, this end up having some interesting properties.
My code:
class rune {
char data[4] {};
public:
rune(char c) {
data[0] = c;
}
// This constructor needs a string, a position and an offset!
rune(std::string const & s, size_t p, size_t n) {
for (size_t i = 0; i < n; ++i) {
data[i] = s[p + i];
}
}
void swap(rune & other) {
rune t = *this;
*this = other;
other = t;
}
// Output as UTF8!
friend std::ostream & operator <<(std::ostream & output, rune input) {
for (size_t i = 0; i < 4; ++i) {
if (input.data[i] == '\0') {
return output;
}
output << input.data[i];
}
return output;
}
};
Error handling ideas:
I don't like to use exceptions in C++. My idea is, if the constructor fails, initialize the rune as 4 '\0'
, then overload the bool operator explicitly to return false if the first byte of the run happens to be '\0'
. Simple and easy to use.
So, thoughts? Opinions? Different approaches?
Even if my rune string is to much, at least I have a rune type. Small and fast to copy. :)