10

I recently passed a null pointer to a std::string constructor and got undefined behavior. I'm certain this is something that thousands or tens of thousands of programmers have done before me, and this same bug has no doubt crashed untold numbers of programs. It comes up a lot when converting from code using char* to code using std::string, and it's the kind of thing that is not catchable at compile time and can easily be missed in run time unit tests.

What I'm confused about is the reason for specifying std::string this way.

Why not just define std::string(NULL)==""?

The efficiency loss would be negligible, I doubt it's even measurable in a real program.

Does anyone know what the possible reason for making std::string(NULL) undefined is?

kdog
  • 1,583
  • 16
  • 28
  • 6
    Ask the C++ committee what their reasoning is behind that omission. – Remy Lebeau Mar 01 '18 at 20:26
  • 4
    This isn't really a rationale, but passing `NULL` to almost any C string function is undefined. `std::strlen`, `std::strcpy`, `std::strchr`, etc... so if you made a special case for the `std::string` constructor, it would be *one* special case that's different from all the others. – Dietrich Epp Mar 01 '18 at 20:26
  • 3
    Accepting a nullptr would be masking a programming error. A nullptr is not a string. Treating one as such is an error. – juanchopanza Mar 01 '18 at 20:28
  • 3
    @kdog Yes, there is `std::strlen`. – juanchopanza Mar 01 '18 at 20:30
  • @kdog: `std::string` is not an object, it is a class. `strlen` is a member of `namespace std`, it is in the `` header. – Dietrich Epp Mar 01 '18 at 20:33
  • @kdog Yes, it totally is a programming error. Now think about this: what must a `char*` point to to be considered a string? – juanchopanza Mar 01 '18 at 20:33
  • "" != NULL. One is unset. One is an empty string. They are quite often not interchangeable (at even a 'business logic' level). And this is not hard: `string st(szNull ? szNull : "");` – zzxyz Mar 01 '18 at 20:33
  • 1
    @kdog http://en.cppreference.com/w/cpp/string/byte/strlen – Killzone Kid Mar 01 '18 at 20:36
  • @KillzoneKid @DietrichEpp Correct - `std::strlen` is in `std::`. – kdog Mar 01 '18 at 20:40
  • 1
    @zzxyz But how is it better to force the user to say `std::string st(szNull ? szNull : "")` and give undefined behavior otherwise than just have the constructor do it? There is no situation where having the undefined behavior be defined does hurts the user, and many where it helps, so why not define it? – kdog Mar 01 '18 at 20:43
  • @kdog Because there are many that don't need the overhead because their code does not pass nullptrs to std::string constructors. People who need a nullptr check can write a function that does that and returns a string. – juanchopanza Mar 01 '18 at 20:48
  • @kdog: This can be dangerous, because if passing a null string in certain places are safe, but null strings are unsafe in other places, it makes it inconsistent. Having it be consistent makes it easier to use and makes the system safer overall. – Dietrich Epp Mar 01 '18 at 20:48
  • You could define a new namespace for yourself and define a new string in there that inherits and does the check in its constructor. It doesn't even matter if it gets sliced when passed to functions because you don't care what happens to it after construction. – Zan Lynx Mar 01 '18 at 20:57
  • @DietrichEpp On the other hand, consistency can be maintained without UB, for example by throwing an exception, as some implementations do by default. – juanchopanza Mar 01 '18 at 21:16
  • This is why I don't use the STL at all and wrote my own container and string types. Standard = Mediocre by definition. – Pablo Ariel Mar 05 '18 at 13:34

2 Answers2

9

No good reason as far as I know.

Someone just proposed a change to this a month ago. I encourage you to support it.

std::string is not the best example of well done standardization. The version initially standardized was impossible to implement; the requirements placed on it where not consistent with each other.

At some point that inconsistency was fixed.

In the rules where changed that prevent COW (copy on write) implementations, which broke the ABI of existing reasonably compliant std::strings. This change may have been the point where the inconsistency was fixed, I do not recall.

Its API is different than the rest of std's container because it didn't come from the same pre-std STL.

Treating this legacy behavior of std::string as some kind of reasoned decision that takes into account performance costs is not realistic. If any such testing was done, it was 20+ years ago on a non-standard compliant std::string (because none could exist, the standard was inconsistent).

It continues to be UB on passing (char const*)0 and nullptr due to inertia, and will continue to do so until someone makes a proposal and demonstrates that the cost is tiny while the benefit is not.

Constructing a std::string from a literal char const[N] is already a low performance solution; you already have the size of the string at compile time and you drop it on the ground and then at runtime walk the buffer to find the '\0' character (unless optimized around; and if so, the null check is equally optimizable). The high performance solution involves knowing the length and telling std::string about it instead of copying from a '\0' terminated buffer.

Yakk - Adam Nevraumont
  • 262,606
  • 27
  • 330
  • 524
  • 1
    Are there archives where the standards committee actually measured the difference in time in the constructor of adding the check? How can a reasonably knowledgeable programmer believe that check makes a measurable time difference to a `std::string` constructor, given how much else that constructor has to do? – kdog Mar 01 '18 at 20:49
  • @kdog Isn't that a perfect example of "don't pay for what you don't use"? It could be done. In a lot of cases the difference would be negligible, but in other cases it would not be. – super Mar 01 '18 at 20:53
  • @kdog Doing something takes more time than not doing it. In C++, in general, you don't force people to do things they don't need. That is why it can be used for performance critical applications, where tiny latence differences can make a huge difference. You don't want to force all the programs to pay a penalty, no matter how tiny, to cover the cases where people pass nullptrs to places where null-terminated strings are expected. – juanchopanza Mar 01 '18 at 20:53
  • 1
    @kdog I strongly suspect the reason is *nobody has written a proposal to change this*. `std::string` is as old as C++ if not older, and is full of design mistakes. It didn't have the same robust use that `std::vector` did. Assuming that `std::string` does something for a *good reason* is questionable. It behaves the way it does because that is how it was standardized 20 years ago; as it happens, the version standardized was *impossible to implement*. It will continue to be UB to pass it `nullptr` and `(char const*)0` unless someone makes a proposal to change it, however. – Yakk - Adam Nevraumont Mar 01 '18 at 21:12
  • @Yakk "std::string is as old as C++ if not older," - the std namespace and std::string were both introduced in C++98, but C++ had been around for a long time without either before that. –  Mar 01 '18 at 21:26
  • 1
    @Yakk Interesting information and great answer to a tough question. I hadn't realized (based on the comments) that naivete about optimization and run-time costs was so prevalent! I can't believe so many commenters think there's a run-time cost to that here - is the committee reasonably knowledgeable about actual run-time costs? Or would they seriously believe that such a change would make a measurable time impact like several posters here? – kdog Mar 01 '18 at 21:26
  • 2
    @kdog: Measure it and post numbers. An extra check will always cost you performance. It is more instructions in your program, more bytes used in your code cache, more instructions which could cause your loop to use an extra cache line, one more entry in your branch prediction unit. No, I am wrong: Millions of bytes in your cache and millions of entries in your branch prediction unit, because this code gets inlined everywhere! :-) Yes, did you know inlining code can also cost performance? It does in a lot of cases. – Johannes Overmann Mar 01 '18 at 22:12
  • 1
    @t.c. awesome, a month ago. – Yakk - Adam Nevraumont Mar 01 '18 at 22:39
  • @JohannesOvermann If your program gets slower due to a check against null, it's because your architecture is plain wrong. Why do you construct so many strings to the point that it makes a difference? – Pablo Ariel Mar 02 '18 at 19:16
  • 1
    @pablarie Johannes is wrong but not for that reason. He's wrong because it's impossible for a real program to be affected by the check against null, which takes no time (it's done while the value of the pointer is being fetched). It's not that his architecture his wrong, it's that his architecture doesn't exist: you can't write a real-world program that would show a timing difference based on that null check, unless you somehow crafted it intentionally and maliciously for one architecture and one compiler that did nothing else. – kdog Mar 02 '18 at 22:23
  • 1
    @kdog No, the null check causes a branch, and branch mispredictions can have a high cost. There *is* a cost here. It is very small *if* compiled right and you never pass null. It coud be modest if you unpredictably and at the last nanosecond made it null or not. – Yakk - Adam Nevraumont Mar 03 '18 at 00:32
  • 1
    @Yakk Multiple people here claim the cost of adding the null check would be noticeable in real programs. One person absurdly said a few minutes per day; but they're all out of touch with reality. They are talking about the cost overhead for a non-null argument std::string constructor of implementing the null check. So to support their argument, you'd have to give std::string a workload of non-null parameters to the constructor, since the old version wouldn't have worked at all with null parameters. So you would never actually get a branch misprediction. You can't benchmark an effect this low. – kdog Mar 03 '18 at 01:16
3

The sole reason is: Runtime performance.

It would indeed be easy to define that std::string(NULL) results in the empty string. But it would cost an extra check at the construction of every std::string from a const char *, which can add up.

On the balance between absolute maximum performance and convenience, C++ always goes for absolute maximum performance, even if this would mean to compromise the robustness of programs.

The most famous example is to not initialize POD member variables in classes by default: Even though in 99% of all cases programmers want all POD member variables to be initialized, C++ decides not to do so to allow the 1% of all classes to achieve slightly higher runtime performance. This pattern repeats itself all over the place in C++. Performance over everything else.

There is no "the performance impact would be negligible" in C++. :-)

(Note that I personally do not like this behavior in C++ either. I would have made it so that the default behavior is safe, and that the unchecked and uninitialized behavior has to be requested explicitly, for example with an extra keyword. Uninitialized variables are still a major problem in a lot of programs in 2018.)

Johannes Overmann
  • 4,914
  • 22
  • 38
  • 1
    No it's not comparable at all. There are some classes where it makes a difference not to have member variables initialized, like a Complex class say, or a class wrapper around one mask. But `string` already has to do a bunch of checks, it is often allocating heap memory. I mean it has to check the whole length of the passed `char*` anyway just to start. An extra check wouldn't make any difference in `string` at all, it's register zero check. – kdog Mar 01 '18 at 20:45
  • 4
    @kdog: I disagree: You need an extra check for 0. How can an extra check not make a performance difference? It will. And yes, it is comparable: You can manually initialize all member variables in a class and you can manually check for 0 in the std::string constructor. There is exactly the same reasoning behind these: Do not waste CPU cycles on stuff which is sometimes unnecessary. – Johannes Overmann Mar 01 '18 at 20:50
  • 1
    In a situation like `class Mask {int m;}` the overhead of making a default member variable would be significant portion of the overhead of the constructor. In `std::string(char*)` the overhead would not be a significant portion because of all the other computation that `std::string` needs to do, like checking the length of the passed string, comparing that length to its buffer size, and copying all the elements. I don't believe you could measure the difference in a real application. With the `class Mask` example, you could measure the difference. – kdog Mar 01 '18 at 20:57
  • 3
    @kdog Please understand that it is important to be able to write programs that save an seemingly insignificant amount of time. A few seconds over many hours can translate into millions of bucks in the right context. And in all my years programming I have not once encountered a situation where a nullptr got passed to a string, mainly because there's a spec that says "don't that". – juanchopanza Mar 01 '18 at 21:04
  • 1
    @juanchopanze there is 0 chance in any real program that adding this change would make a 1 second difference running over 24 hours. I can't understand why you (and several other posters here) don't see this. I don't even think you could benchmark the difference, it's a single register access that is part of a much more expensive, branching function. It's unmeasurable in a benchmark, and in any program where anything was actually done to any of the strings, completely unmeasureable. – kdog Mar 01 '18 at 21:28
  • 1
    @kdog You seem to have this misplaced notion that small things are not measurable. I am not sure where you're getting that from, but benchmarking exactly this kind of thing is what good library implementers spend a lot of time doing. – juanchopanza Mar 01 '18 at 21:36
  • 2
    @juanchopanza Memory hierarchy costs would swamp 0 check. The string constructor has to follow char*, which is most of the time. It has to effectively do a `strlen` on it, accessing each character checking for null. It has to adds to get the length and compare that length to its internal buffer size if short string optimization is done. If so it has to copy each char to its internal buffer, otherwise it has to allocate new heap memory and then do the copy. This all requires lots of memory accesses. You can't measure a single register access against that. It would be lost in the noise. – kdog Mar 01 '18 at 21:45
  • 1
    @kdog You run enough experiments to make the noise smaller than the size of the effect that you're trying to measure. – juanchopanza Mar 01 '18 at 21:46
  • 2
    Imagine a program that sits there all day scanning Internet traffic looking for keywords that will lead it to the terrorist of the week. This will be parsing web pages into balls of `std::string` to the tune of trillions of `string`s per hour. One extra test could minutes to the runtime by the end of the day. – user4581301 Mar 01 '18 at 21:46
  • "How can an extra check not make a performance difference?" - I understand this is common on modern CPUs, the extra check and then the normal behaviour are pipelined; if the check succeeds then there is literally no performance hit, there is only bad performance when the check fails and then the pipe results are abandoned – M.M Mar 01 '18 at 21:59
  • 2
    @M.M: Do not forget about cache pressure and branch prediction entries: Both will cost you performance even if the actual check does not cost any time. It costs space. And std::string is inlined a lot. – Johannes Overmann Mar 01 '18 at 22:16
  • 1
    @user4581301 You think that adding a null check to a program allocating std::string, that has to access memory multiple times, to multiple checks, and often access the heap, could add "minutes" in a day? Where do all you folks come from? There is like 4 very vocal people on this thread who have very strong opinions about optimization but clearly haven't done it all that much. Honestly a lot of the C++ threads on stackoverflow are filled with people with strong and wrong optimization opinions. – kdog Mar 02 '18 at 22:28
  • 2
    @kdog You seem you have an answer in mind that you want to hear, so why don't you post it already, with some stats to back it up? – juanchopanza Mar 02 '18 at 22:31
  • 1
    @kdog Depends. My argument is that there is an infinitesimally small cost that will require trillions of executions to be noticeable. I used that example because 15 years ago I wrote a program that did almost exactly that. I wrote it in C, but the point is the same. When you are doing things trillions of times, the stupid little things that you normally ignore (I happily eat the zero initializing of a `vector` most of the time) start adding up. Virtually everyone will not care. The rest are probably not using `string` anyway. – user4581301 Mar 03 '18 at 00:04
  • @user4581301 A c program written 15 years ago has nothing to do with the overhead of a std::string constructor. There is no way as I have explained that in a real program you would see a delay in "minutes" over a day due to the null check on a string constructor. – kdog Mar 03 '18 at 00:07
  • 1
    @user4581301 More seriously it's a real flaw in stackoverflow that people pontificate about things they don't know anything about. If you believe, as one user says, that adding a null check to the std::string constructor could result in a minutes delay in a day long computation in a real program, you should not be answering questions on stackoverflow - you should be asking them. And if you believe there is even a few seconds delay in a real program, same. Noone here is expert in everything, but if you don't know anything about compilers DON'T SPECULATE. – kdog Mar 03 '18 at 00:10