3

I overheard sometime ago a discussion about how when creating a templated string class that you should not use strcmp, strcpy and strlen for a templated string class that can make use of UTF8 and UTF16. From what I recall, you are suppose to use functions from algorithm.h, however, I do not remember how the implementation is, or why it is so. Could someone please explain what functions to use instead, how to use them and why?

The example of the templated string class would be something such as

String<UTF8> utf8String;
String<UTF16> utf16String; 

This is where UTF8 will be a unsigned char and UTF16 is an unsigned short.

mmurphy
  • 1,327
  • 4
  • 15
  • 30
  • the header is named just `algorithm`, not `algorithm.h`. Also if you do much C++ programming, familiarizing yourself with the standard library is helpful so you don't have to ask what's available, and also so you'll be more likely to recognize any random situation where a library algorithm would be perfect. I see a lot of code that should never have been written, but the author just didn't know that the standard library had an algorithm for them. http://en.cppreference.com/w/cpp/algorithm – bames53 Dec 31 '11 at 09:32

3 Answers3

6

First off, C++ has no need of additional string classes. There are probably already hundreds or thousands too many string classes that have been developed, and yours won't improve the situation. Unless you're doing this purely for your edification, you should think long and hard and then decide not to write a new one.

You can use std::basic_string<char> to hold UTF-8 code unit sequences, std::basic_string<char16_t> to hold UTF-16 code unit sequences, std::basic_string<char32_t> to hold UTF-32 code unit sequences, etc. C++ even offers short, handy names for these types: string, u16string, and u32string. basic_string already solves the problem you're asking about here by offering member functions for copying, comparing, and getting the length of the string that work for any code unit you template it with.

I can't think of any good reason for new code that's not interfacing with legacy code to use anything else as its canonical storage type for strings. Even if you do interface with legacy code that uses something else, if the surface area of that interface isn't large you should probably still use one of the standard types and not anything else, and of course if you're interfacing with legacy code you'll be using that legacy type anyway, not writing your own new type.


With that said, the reason you can't use strcmp, strcpy, and strlen for your templated string type is that they all operate on null terminated byte sequences. If your code unit is larger than one byte then there may be bytes that are zero before the actual terminating null code unit (assuming you use null termination at all, which you probably shouldn't). Consider the bytes of this UTF-16 representation of the string "Hello" (on a little endian machine).

48 00 65 00 6c 00 6c 00  6f 00

Since UTF-16 uses 16 bit code units, the character 'H' ends up stored as the two bytes 48 00. A function operating on the above sequence of bytes by assuming the first null byte is the end would assume that the second half of the first character marks the end of the whole string. This clearly will not work.

So, strcmp, strcpy, and strlen are all specialized versions of algorithms that can be implemented more generally. Since they only work with byte sequences, and you need to work with code unit sequences where the code unit may be larger than a byte, you need need generic algorithms that can work with any code unit. The standard library offers has lots of generic algorithms to offer you. Here are my suggestions for replacing these str* functions.

strcmp compares two sequences of code units and returns 0 if the two sequences are equal, positive if the first is lexicographically less than the second, and negative otherwise. The standard library contains the generic algorithm lexicographical_compare which does nearly the same thing, except that it returns true if the first sequences is lexicographically less than the second and false otherwise.

strcpy copies a sequences of code units. You can use the standard library's copy algorithm instead.

strlen takes a pointer to a code unit and counts the number of code units before it finds a null value. If you need this function as opposed to one that just tells you the number of code units in the string, you can implement it with the algorithm find by passing the null value as the value to be found. If instead you want to find the actual length of the sequence, your class should just offer a size method that directly accesses whatever method your class uses internally to store the size.

Unlike the str* functions, the algorithms I've suggested take two iterators to demarcate code unit sequences; one pointing to the first element in the sequence, and one pointing to the position after the final element of the sequence. The str* functions only take a pointer to the first element and then assume the sequence continues until the first zero valued code unit it finds. When you're implementing your own templated string class it's best to move away from the explicit null termination convention as well, and just offer an end() method that provides the correct end point for your string.

bames53
  • 86,085
  • 15
  • 179
  • 244
  • IMO that's better left to non-member functions. I think a good model for implementing Unicode support could be iterator adapters. You might have one that takes a UTF-8 code unit iterator (e.g. from a `std::string::begin()`) and acts as a code point iterator. and then you could have an iterator adapter that implements the Unicode glyph boundary algorithm by taking a code point iterator and lets you iterator over glyphs. And then algorithms that need to work on code points or glyphs would just take the appropriately adapted iterator. There really is no need for more string classes. – bames53 Dec 31 '11 at 09:49
  • I think a container adaptor like priority_queue wouldn't really count as a whole new string class. It would be less offensive than a whole new string class, though it does still exhibit some of the same problems. – bames53 Dec 31 '11 at 10:33
  • @DietmarKühl And frankly I'm not sure how much it could protect the user anyway. E.g. if it allows code point oriented access and the user tries to do something that should be done at the level of user perceived characters they'll be able to do the wrong thing. I think on balance I'd prefer a standard string type to whatever small safeties a custom type might offer. – bames53 Dec 31 '11 at 10:45
2

The reason you can't use strcmp, strcpy, or strlen is that they operate on strings whose length is indicate by a terminating zero byte. Since your strings may contain zero bytes inside them, you can't use these functions.

I would just code exactly what you want. What you want depends on what you're trying to do.

David Schwartz
  • 179,497
  • 17
  • 214
  • 278
1

In UTF16, you may see bytes that are equal to '\0' in the middle of the string; strcmp, strcpy, and strlen will return incorrect results for strings like that, because they operate under the assumption that strings are zero-terminated.

You can use copy, equal, and distance from the STL to copy, compare, and calculate length based on template-based iterators.

Sergey Kalinichenko
  • 714,442
  • 84
  • 1,110
  • 1,523
  • While I see std::copy, I do not see a function for compare and calculate. Can you please elaborate on this as mentioned in the original question? – mmurphy Dec 31 '11 at 06:34
  • @mmurphy Use [`lexicographical_compare`](http://www.sgi.com/tech/stl/lexicographical_compare.html) to compare lexicographically or [`equal`](http://www.sgi.com/tech/stl/equal.html) to compare for equality. Use [`distance`](http://www.sgi.com/tech/stl/distance.html) to calculate the number of items between two iterators pointing at the beginning and at the end of your string. – Sergey Kalinichenko Dec 31 '11 at 06:44