First off, C++ has no need of additional string classes. There are probably already hundreds or thousands too many string classes that have been developed, and yours won't improve the situation. Unless you're doing this purely for your edification, you should think long and hard and then decide not to write a new one.
You can use std::basic_string<char>
to hold UTF-8 code unit sequences, std::basic_string<char16_t>
to hold UTF-16 code unit sequences, std::basic_string<char32_t>
to hold UTF-32 code unit sequences, etc. C++ even offers short, handy names for these types: string
, u16string
, and u32string
. basic_string
already solves the problem you're asking about here by offering member functions for copying, comparing, and getting the length of the string that work for any code unit you template it with.
I can't think of any good reason for new code that's not interfacing with legacy code to use anything else as its canonical storage type for strings. Even if you do interface with legacy code that uses something else, if the surface area of that interface isn't large you should probably still use one of the standard types and not anything else, and of course if you're interfacing with legacy code you'll be using that legacy type anyway, not writing your own new type.
With that said, the reason you can't use strcmp
, strcpy
, and strlen
for your templated string type is that they all operate on null terminated byte sequences. If your code unit is larger than one byte then there may be bytes that are zero before the actual terminating null code unit (assuming you use null termination at all, which you probably shouldn't). Consider the bytes of this UTF-16 representation of the string "Hello" (on a little endian machine).
48 00 65 00 6c 00 6c 00 6f 00
Since UTF-16 uses 16 bit code units, the character 'H' ends up stored as the two bytes 48 00
. A function operating on the above sequence of bytes by assuming the first null byte is the end would assume that the second half of the first character marks the end of the whole string. This clearly will not work.
So, strcmp
, strcpy
, and strlen
are all specialized versions of algorithms that can be implemented more generally. Since they only work with byte sequences, and you need to work with code unit sequences where the code unit may be larger than a byte, you need need generic algorithms that can work with any code unit. The standard library offers has lots of generic algorithms to offer you. Here are my suggestions for replacing these str*
functions.
strcmp
compares two sequences of code units and returns 0 if the two sequences are equal, positive if the first is lexicographically less than the second, and negative otherwise. The standard library contains the generic algorithm lexicographical_compare
which does nearly the same thing, except that it returns true if the first sequences is lexicographically less than the second and false otherwise.
strcpy
copies a sequences of code units. You can use the standard library's copy
algorithm instead.
strlen
takes a pointer to a code unit and counts the number of code units before it finds a null value. If you need this function as opposed to one that just tells you the number of code units in the string, you can implement it with the algorithm find
by passing the null value as the value to be found. If instead you want to find the actual length of the sequence, your class should just offer a size
method that directly accesses whatever method your class uses internally to store the size.
Unlike the str*
functions, the algorithms I've suggested take two iterators to demarcate code unit sequences; one pointing to the first element in the sequence, and one pointing to the position after the final element of the sequence. The str*
functions only take a pointer to the first element and then assume the sequence continues until the first zero valued code unit it finds. When you're implementing your own templated string class it's best to move away from the explicit null termination convention as well, and just offer an end()
method that provides the correct end point for your string.