0

I wonder if it is possible to subclass std::string to create a new type with the same behavior, but different meaning, for example an UTF-8 string. I am thinking about something like Django's safestring or Nim's distinct subtypes.

I would like this type to express "this string has been created through one of my UTF-8 producing functions and is guaranteed to be valid UTF-8 (and not some other encoding)". Then the type system could prevent me from accidentially mixing encodings.

UTF-8 is just an example, it can be any other "distinct string" - e.g. "user input", "natural language" as opposed to machine readable keys, and so on. I am not trying to enforce e.g. UTF-8 validity, just origin and "flavor" (type/kind/variant) of my string. I just want to be able to write the following, and have it fail if aString is not a MyString:

void processString(MyString str);
// ...
processString(aString);

I have read the other discussions about subclassing std::string, but am not sure what the conclusion would be for this very specific case. My class would not have additional fields, so slicing would not be a problem, and it would not need to override methods, so it should be OK that none of std::string's methods are virtual. Is there anything I have to define in my subclass for this to work as I want it?

jdm
  • 9,470
  • 12
  • 58
  • 110
  • To answer the only two questions you asked: yes, it is possible; and you will have to define all methods that modify the string. Also, the absence of any additional fields does not eliminate the object slicing issue, because your class, unless it privately inherits from `std::string` can still be trivially sliced into a `std::string`. So you have to privately inherit from `std::string`. And then, in addition to implementing all methods that modify the contained string, also publicly inherit all `const` string methods. The real question you're asking here: yes, it is a lot of work. – Sam Varshavchik Jul 20 '18 at 21:54
  • how are you going to implement `operator[]` of UTF-8 string? – Andriy Tylychko Jul 20 '18 at 21:57
  • @AndriyTylychko: Just like a regular `std::string`, bytewise. I am not going after different behavior, I just want to mark some objects as "guaranteed UTF-8". Providing functions that are aware of grapheme clusters etc. is out of scope and would be a whole different project. – jdm Jul 20 '18 at 22:06
  • @SamVarshavchik: How is slicing a problem if I inherit publicly but have no additional fields and no virtual functions? Shouldn't the memory layout be identical? – jdm Jul 20 '18 at 22:09
  • 2
    An identical memory layout means utterly nothing, whatsoever. Object slicing is not about memory layout. There's nothing to prevent assigning an instance of your subclass to a `string &`, and even if your object enforces UTF-8 correctness in its methods, now that you have a `string &`, you can dump any non-UTF-8 content into it, directly. – Sam Varshavchik Jul 20 '18 at 22:13
  • Sure, but this is C++. I could just cast `str[0]` to `(void *)` and change the bits as I please, there is nothing stopping me from putting non-valid UTF8 in there. I don't have that expectation. I just want to store one bit - "yup, I tell you this is UTF8" - into the type. After that, it is all "consenting adults" of course. – jdm Jul 20 '18 at 22:24
  • @jdm That is both (possibly) UB and explicit casting; the `string&` is both defined behaviour and implicit. – Yakk - Adam Nevraumont Jul 20 '18 at 23:41
  • It is not usually considered advisable to inherit from a C++ standard type (like `std::string`) unless that type is designed to be used as a base class - which `std::string` is not. An alternative is to provide a wrapper - a class that *contains* a `std::string` and provides all needed operations (member functions, operators, etc) that do necessary checking and work directly with the contained string. That way the wrapper can ensure the contained `std::string` is only ever encoded using UTF-8 as needed. And the user of the class never needs to interact directly with the contained string. – Peter Jul 21 '18 at 01:07

1 Answers1

0

Standard library containers were not designed for inheritance. They were designed to be simple and fast. Some string methods are not very useful for e.g. UTF-8 string. E.g. size() or operator[] can have surprising results. It makes sense to separate concepts. This can be easily achieved by aggregation.

class UTF8_string
{
public:
    // UTF-8 specific functionality

    std::string const& byte_string() const;
private:
    std::string content;
};

In this case when user calls utf8_string.byte_string().size() it's obvious what is the intend.

Andriy Tylychko
  • 15,967
  • 6
  • 64
  • 112
  • Why do you have a non-const `byte_string()` return by reference? That allows outside code to modify the internal `std::string` into whatever they want that is not UTF-8. It would make more sense to have `byte_string()` return by value or const reference instead – Remy Lebeau Jul 20 '18 at 23:18
  • @RemyLebeau: yeah, just `const` version is enough. Thanks, edited – Andriy Tylychko Jul 21 '18 at 15:27