38

I have discovered a disturbing inconsistency between std::string and string literals in C++0x:

#include <iostream>
#include <string>

int main()
{
    int i = 0;
    for (auto e : "hello")
        ++i;
    std::cout << "Number of elements: " << i << '\n';

    i = 0;
    for (auto e : std::string("hello"))
        ++i;
    std::cout << "Number of elements: " << i << '\n';

    return 0;
}

The output is:

Number of elements: 6
Number of elements: 5

I understand the mechanics of why this is happening: the string literal is really an array of characters that includes the null character, and when the range-based for loop calls std::end() on the character array, it gets a pointer past the end of the array; since the null character is part of the array, it thus gets a pointer past the null character.

However, I think this is very undesirable: surely std::string and string literals should behave the same when it comes to properties as basic as their length?

Is there a way to resolve this inconsistency? For example, can std::begin() and std::end() be overloaded for character arrays so that the range they delimit does not include the terminating null character? If so, why was this not done?

EDIT: To justify my indignation a bit more to those who have said that I'm just suffering the consequences of using C-style strings which are a "legacy feature", consider code like the following:

template <typename Range>
void f(Range&& r)
{
    for (auto e : r)
    {
        ...
    }
}

Would you expect f("hello") and f(std::string("hello")) to do something different?

HighCommander4
  • 50,428
  • 24
  • 122
  • 194
  • 6
    Is this a real question? It reads more like a personal opinion about what the standard should be instead of what it is. – Gene Bushuyev Jul 18 '11 at 20:33
  • Based on some of the answers and comments, I'm now wondering if the people in charge of determining the features for future versions of C++ have considered adding new string literal syntax for `std::string` strings. I mean, Objective-C and C# both use `@""` to indicate a non-C-style string literal, and even in C and C++ you have the `L""` syntax to indicate wide-character string literals. (And it seems `L''` can be used to indicate literal `wchar`s?) – JAB Jul 18 '11 at 20:34
  • @JAB: and what is exactly so wrong with string literal that would warrant yet another built-in type? – Gene Bushuyev Jul 18 '11 at 20:39
  • 1
    @Gene: Why did C implement a boolean type when integer types served the purpose perfectly well? – JAB Jul 18 '11 at 20:46
  • 1
    @JAB: In C++0x you'll be able to *create* a new string literal syntax for `std::string` via user-defined literals. – HighCommander4 Jul 19 '11 at 04:00
  • @Gene: Critical discussion is what keeps the standard improving over time :) – HighCommander4 Jul 19 '11 at 04:21
  • @HighCommander4: Well that's neat. – JAB Jul 19 '11 at 14:02

6 Answers6

29

If we overloaded std::begin() and std::end() for const char arrays to return one less than the size of the array, then the following code would output 4 instead of the expected 5:

#include <iostream>

int main()
{
    const char s[5] = {'h', 'e', 'l', 'l', 'o'};
    int i = 0;
    for (auto e : s)
        ++i;
    std::cout << "Number of elements: " << i << '\n';
}
Howard Hinnant
  • 206,506
  • 52
  • 449
  • 577
  • 3
    Perhaps there is a way to tell apart character arrays defined as a string literal from character arrays defined normally? We would only want to overload for the former. – HighCommander4 Jul 17 '11 at 23:44
  • 1
    I don't know of a way to do that in the library. You would have to make a language change, and that change would break code. Narrow string literals are defined to be an array of n const char, where n is the number of characters plus one for the terminating null. – Howard Hinnant Jul 18 '11 at 00:36
  • 1
    It doesn't *have* to break code... it can keep the type the same, and just have the compiler remember the origin of the character array and report it through an intrinsic like __is_string_literal(char_array). But a way to do it in the library would have been nice... – HighCommander4 Jul 18 '11 at 00:46
  • 9
    Any solution would need to address what to do with `const char s[6] = {'h', 'e', 'l', 'l', 'o', '\0'};`. I'm siding with Howard here, C++ programmers should know that `sizeof("Hello")==6` – MSalters Jul 18 '11 at 08:19
  • This char s[] array is first and foremost a C-style array. The standard is consistent for `auto` iterators is across those C-style arrays. – David Hammen Jul 19 '11 at 01:19
  • @MSalters: No one is talking about `sizeof` here. `sizeof` is a C thing, and of course we don't want to change it. We're talking about what `begin(s)` and `end(s)` return, which are C++0x things. [continued below] – HighCommander4 Jul 19 '11 at 04:13
  • My suggestion was that for character array defined using the `"..."` syntax, `end(s)` point not one-past-the-null, but **to** the null (which is one-past-the last actual character of the string), while for character arrays defined any other way, `end(s)` should point one-past-the last element of the array, as before (regardless of whether or not the array's last element happens to be a null character). [continued below] – HighCommander4 Jul 19 '11 at 04:16
  • So according to my suggestion, `const char s[6] = {'h', 'e', 'l', 'l', 'o', '\0'};` would have length 6, while `const char s[6] = "hello";` would have length 5 (where by "length" I mean `end(s) - begin(s)`). I see nothing wrong with that. It doesn't break anything, because `begin(s)` and `end(s)` are new things in C++0x, so there is nothing for it to break! – HighCommander4 Jul 19 '11 at 04:17
  • 2
    @HighCommander4: I used `sizeof("Hello")==6` as a quick way to write that in C as well as C++, string literals _are_ constant char arrays with length N+1, including a terminating \0. Compilers need not, and probably do not distinguish between the two, by the time they're doing argument overloading. That means you would force a major compiler redesign for a minor feature. – MSalters Jul 19 '11 at 07:51
  • @MSalters: I'm not convinced that remembering whether a character array originated as a quoted string literal or something else requires a "major compiler redesign"; but, to be fair, I'm not extensively familiar with compiler internals, so I could be wrong. In any case, I'm not saying this solution would be great... I was just pointing out there *is* a solution that wouldn't break code. – HighCommander4 Jul 19 '11 at 08:33
  • 2
    I just realized it's worse than that. One translation unit could define `char const s[6]="Hello";` and another could call `end(s)-begin(s)`. That means that the difference betwene string literals and string arrys would require an ABI change. Sorry, that's just not going to happen. – MSalters Jul 19 '11 at 08:43
  • @HighCommander4: `= {'h', 'e', 'l', 'l', 'o', '\0'}` and `= "hello"` are just two ways to initialize objects of the same type to the same value. Therefore, they are indistinguishable. – musiphil Mar 14 '13 at 23:06
22

However, I think this is very undesirable: surely std::string and string literals should behave the same when it comes to properties as basic as their length?

String literals by definition have a (hidden) null character at the end of the string. Std::strings do not. Because std::strings have a length, that null character is a bit superfluous. The standard section on the string library explicitly allows non-null terminated strings.

Edit
I don't think I've ever given a more controversial answer in the sense of a huge amount of upvotes and a huge amount of downvotes.

The auto iterator when applied to a C-style array iterates over each element of the array. The determination of the range is made at compile-time, not run time. This is ill-formed, for instance:

char * str;
for (auto c : str) {
   do_something_with (c);
}

Some people use arrays of type char to hold arbitrary data. Yes, it is an old-style C way of thinking, and perhaps they should have used a C++-style std::array, but the construct is quite valid and quite useful. Those people would be rather upset if their auto iterator over a char buffer[1024]; stopped at element 15 just because that element happens to have the same value as the null character. An auto iterator over a Type buffer[1024]; will run all the way to the end. What makes a char array so worthy of a completely different implementation?

Note that if you want the auto iterator over a character array to stop early there is an easy mechanism to do that: Add a if (c == '0') break; statement to the body of your loop.

Bottom line: There is no inconsistency here. The auto iterator over a char[] array is consistent with how auto iterator work any other C-style array.

David Hammen
  • 32,454
  • 9
  • 60
  • 108
19

That you get 6 in the first case is an abstraction leak that couldn't be avoided in C. std::string "fixes" that. For compatibility, the behaviour of C-style string literals does not change in C++.

For example, can std::begin() and std::end() be overloaded for character arrays so that the range they delimit does not include the terminating null character? If so, why was this not done?

Assuming access through a pointer (as opposed to char[N]), only by embedding a variable inside the string containing the number of characters, so that seeking for NULL isn't required any more. Oops! That's std::string.

The way to "resolve the inconsistency" is not to use legacy features at all.

Lightness Races in Orbit
  • 378,754
  • 76
  • 643
  • 1,055
  • 6
    "not to use legacy features at all." Not using string literals seems like a hard task (and having to remember string literals are a "legacy" feature may be a hard task as well). – Suma Jul 18 '11 at 19:45
  • @Suma: Well, I'm talking about passing `char const*` or `char[N]` around. String literals themselves are of course still perfectly reasonable. Admittedly, it is string literals that the OP used in his question; I guess the `for (auto c : "literal")` _is_ a bit of a tricky one. Regardless, `std::string` _is_ the "fix" for the behaviour that the OP doesn't like. – Lightness Races in Orbit Jul 18 '11 at 21:01
6

According to N3290 6.5.4, if the range is an array, boundary values are initialized automatically without begin/end function dispatch.
So, how about preparing some wrapper like the following?

struct literal_t {
    char const *b, *e;
    literal_t( char const* b, char const* e ) : b( b ), e( e ) {}
    char const* begin() const { return b; }
    char const* end  () const { return e; }
};

template< int N >
literal_t literal( char const (&a)[N] ) {
    return literal_t( a, a + N - 1 );
};

Then the following code will be valid:

for (auto e : literal("hello")) ...

If your compiler provides user-defined literal, it might help to abbreviate:

literal operator"" _l( char const* p, std::size_t l ) {
    return literal_t( p, p + l ); // l excludes '\0'
}

for (auto e : "hello"_l) ...

EDIT: The following will have smaller overhead (user-defined literal won't be available though).

template< size_t N >
char const (&literal( char const (&x)[ N ] ))[ N - 1 ] {
    return (char const(&)[ N - 1 ]) x;
}

for (auto e : literal("hello")) ...
Ise Wisteria
  • 11,259
  • 2
  • 43
  • 26
  • I have an implementation for literal: std::string. Use the tools at hand. Everyone knows C strings have a terminating NULL. – emsr Jul 18 '11 at 23:36
  • Thank you for pointing out. Though the above way might give brevity with user-defined literal, it has an overhead, and seems not to have much advantage over `std::string`. I should've mentioned an obvious way with an array. I edited the answer. – Ise Wisteria Jul 19 '11 at 10:29
4

If you wanted the length, you should use strlen() for the C string and .length() for the C++ string. You can't treat C strings and C++ strings identically--they have different behavior.

robert
  • 33,242
  • 8
  • 53
  • 74
  • 1
    The question is related to how the updated C++ standard (C++0x) defines `for (auto e: someexp) {}` and how that differs when the expression is a string-lit rather than a char-array or std::string -- hence it got nothing to do with `strlen` or the correct method of getting the length. – Soren Jul 18 '11 at 05:41
  • @Soren the original poster explicitly called out length as one of the reasons he thought this behavior was wrong. – robert Jul 18 '11 at 12:05
3

The inconsistency can be resolved using another tool in C++0x's toolbox: user-defined literals. Using an appropriately-defined user-defined literal:

std::string operator""s(const char* p, size_t n)
{
    return string(p, n);
}

We'll be able to write:

int i = 0;     
for (auto e : "hello"s)         
    ++i;     
std::cout << "Number of elements: " << i << '\n';

Which now outputs the expected number:

Number of elements: 5

With these new std::string literals, there is arguably no more reason to use C-style string literals, ever.

HighCommander4
  • 50,428
  • 24
  • 122
  • 194
  • 4
    Note: User-defined literals **must** start with an underscore. Also, another answer already suggested literals - why not accept that one? – Xeo Jan 12 '12 at 15:25