8

I have been wondering about the rationale behind the design of std::string's substr(pos, len) method for a while now. It still does not make sense to me, so I decided to ask the experts. The function throws a std::out_of_range exception if the pos argument exceeds the string length plus one. This can be inconvenient (even annoying) at times, but my real concern is consistency and the principle of least surprise. It turns out that the "end" position pos+len of the substring is allowed to exceed the string length plus one. Disallowing this for the beginning but not for the end feels inconsistent to me. Allowing it for the end to me hints at the interpretation

return all characters at positions pos <= i < pos+len

however, then I would expect the function to return an empty string for values of pos exceeding the string length, instead of throwing an exception. As a side note, with this interpretation it would even be sensible to allow for negative values of pos (provided it had a signed type).

This leaves me with the following questions:

  • Does this design appear logical to you? Sensible? Do you have a satisfactory way to resolve the inconsistency? The only possible explanation I can come up with is compatibility with null-terminated strings. With null termination it does not matter if the specified length exceeds the end, while starting beyond the null character is a memory bug. However, std::string is not null-terminated and instead keeps track of the length of the string. If that's the true reason then personally I'd call that a very bad one.
  • Is there an advantage in terms of performance? I would actually be surprised.
  • Am I overlooking an advantage in terms of usability? Maybe a standard idiom or use case in conjunction with other functions, like find? Also here my impression is that returning an empty string had the potential to simplify some code.
  • Is there any way to change the behavior of substr in the future? I guess no, since silently breaking existing code is must worse than living with this twist...?
tglas
  • 949
  • 10
  • 19
  • 1
    It makes it easy to say "substring from `pos` to the end", which isn't an uncommon operation. Think of the second argument as a cap on the number of characters in the returned substring. – T.C. Jul 13 '16 at 19:14
  • This question is probably too opinion based. You present an interesting view of a string being like an infinitely sized virtual character buffer from which you can select a slice, even from outside the physical range. I think, though, more subtle errors would go undetected. By throwing a range exception you at least trap some badly calculated position values. Maybe that's the reason? – Galik Jul 13 '16 at 19:17
  • 1
    I've always said it, throwing an exception on contract violation (the pre-condition here) is bad design, even if the exception is part of your contract. What is the caller gonna do, catch the exception and call the function in a loop with `pos/=2` until it stops throwing? Better to just assert-fail with a useful message (ex: "Sanitize your input!"). But, STL was written many, many years ago, and maybe people didn't know better back then. Now, we have to live with that behavior. – KABoissonneault Jul 13 '16 at 19:20
  • @Galik The first (and main) question is indeed opinion-based. The others are not. I see the point of undetected errors, but doesn't that also apply to the end position? We simply got used to the fact that it can exceed the size, but that's not particularly safe. I am happy with checking both ends, that's a consistent alternative. – tglas Jul 13 '16 at 19:22
  • 3
    Where do you get that `substr` throws is `pos > length + 1`? The standard says *Throws:* **out_of_range** if `pos > size()`. – NathanOliver Jul 13 '16 at 19:34
  • Why do you think that `std::string` is _not_ null-terminated in it's implementation? – Ternvein Jul 13 '16 at 19:43
  • @NathanOliver Sorry, my formulation was unclear. It throws if `pos >= length + 1`, which is the same as `pos > length`. My point is that it throws at all. – tglas Jul 13 '16 at 19:53
  • @Ternvein I know that it is null-terminated in the implementation to make `c_str` work without copying, but I think that at least in C++03 you cannot rely on that. The real point is that `'\0'` is a legal character in a `std::string` since the length is *not* determined by null-termination. Therefore I think that an interface that relies on null-termination is unnatural for `std::string`. – tglas Jul 13 '16 at 19:56

1 Answers1

3

This question really too opinion-based, but I will try to answer it point by point.

  • Does this design appear logical to you? Sensible? It seems logical to me. Maybe such opinion came from strncmp-styled functions, but with such design you can just pass your buffer length for len parameter and it will work fine. But, if you're trying to access substring that is located outside your string boundaries, then you probably missed some simple sanity checks. And internal implementation of std::string doesn't matter.
  • Is there an advantage in terms of performance? I think that's not the reason.
  • Am I overlooking an advantage in terms of usability? Maybe, look at point 1.
  • Is there any way to change the behavior of substr in the future? Throwing exception on pos exceeding size() is defined in standard, so most likely no.

My point is: this exception (though I prefer to never use those) allowes you to take notice of the code that missing some elementary sanity checks, like accessing the buffer outside it's boundaries. The same design is used in at()-like functions and many other.

Ternvein
  • 308
  • 1
  • 7
  • Thanks for the answer. I find it debatable whether starting beyond the end is a more basic sanity check than ending beyond the end, but as you say, this simply comes from having gotten used to strncmp and friends (including `std::string::substr`). My favorite example `size_t pos = s.find(' '); string cmd = s.substr(0, pos); string args = s.substr(pos+1)` is a reasonable use case where I would prefer an empty string over an exception, without having missed an "elementary" sanity check (no arg is okay). Using the buffer size for `len` is convenient, here the same would apply to `pos`. – tglas Jul 13 '16 at 20:35
  • Well, you can always write some class around `std::string` and make it work like you want it to. – Ternvein Jul 13 '16 at 20:52
  • 1
    Actually that's what I do (a simple free function), but it annoys me :) – tglas Jul 13 '16 at 21:01