6

Consider the following code:

std::vector<std::string> foo{{"blee"}, {"bleck"}, {"blah0000000000000000000000000000000000000000000000000000000000000000000000000000000000"}};
std::string *temp = foo.data();
char*** bar = reinterpret_cast<char***>(&temp);

for (size_t i = 0; i < foo.size(); ++i){
    std::cout << (*bar)[i] << std::endl;
}

Clearly this is sketchy code, but it happens to work.

http://ideone.com/2XAJYR

I would like to know why it works? Are there some strange rules of C++ I don't know about? Or is it just bad code and undefined behaviour?

I made one of the strings huge in case there was some small-string optimization going on.

Adapted from: Cast a vector of std::string to char***

Community
  • 1
  • 1
Neil Kirk
  • 21,327
  • 9
  • 53
  • 91
  • 7
    it's possible that the string stores the pointer to its buffer as the first structure member, hence its address is the same as that of the string object itself. I wouldn't say "it works"; it's rather *pretending* to work. – The Paramagnetic Croissant Mar 13 '15 at 12:56
  • 1
    @TheParamagneticCroissant yes I bet on it too, but I don't think it's a good idea to rely on that creating code that should always work... – W.F. Mar 13 '15 at 12:57
  • I think it has to do with std::vector behaviour. It **guarantees** to have continous memory and even copies and moves its data around to do fulfill this guarantee. If put in there static std::strings of fixed memory size, they are beeing placed in continous block of memory and you can do tricks like that :). To check it you can push_back another string in your vector and see if your prevoius data pointers are still valid. – Amadeusz Mar 13 '15 at 13:13

4 Answers4

7

It is very much undefined behaviour.

It will appear to "work" if the string implementation happens to contain a pointer to the string data as its only data member, so that an array of string has the same memory layout as an array of char*. That is the case for at least one popular implementation (GNU), but is certainly not something you can rely on.

Mike Seymour
  • 249,747
  • 28
  • 448
  • 644
  • 1
    this is a particularly nasty one when you're dealing with C libraries. I've seen some cases where a `std::string` is passed as `void *`, which works great until someone decides to recompile with `clang`. – Shep Mar 13 '15 at 14:30
  • @Shep: If you're dealing with C libraries, then use `c_str()` to get a C-compatible pointer in a well-defined manner. Dodgy type-punning is nasty in any situation. – Mike Seymour Mar 13 '15 at 14:33
  • @Shep except, that might be ok! There's a paradigm known as "opaque pointers" or "opaque handles" that let's you specify e.g. user-data to be passed to callback functions. The library will not interpret the `void*` in any way in these situations. The user-callback _knows_ what the type actually is and can cast it back. (PS. Of course that's passing `&string` instead of the string itself as `void*`) – sehe Mar 13 '15 at 14:35
  • What I don't understand is why `(*bar)[i]` knows to increment the size of the string object to the next, and doesn't just go by one pointer. – Neil Kirk Mar 13 '15 at 14:42
  • 1
    @NeilKirk: As I said, it "works" if a `char*` pointer is the **only** data member, so that an array of `string` has the same layout as an array of `char*`. It will explode if it contains other data members. – Mike Seymour Mar 13 '15 at 14:44
  • I thought string had to contain size and capacity info also. Maybe it got optimized out. – Neil Kirk Mar 13 '15 at 14:53
  • @NeilKirk: In the case of the GNU implementation, that's stored just before the string data in memory, accessible via the pointer in the string object. – Mike Seymour Mar 13 '15 at 14:55
  • 1
    Ooooooooo now it makes sense. I assumed that size and capacity would be member variables of the string object itself. – Neil Kirk Mar 13 '15 at 15:04
  • @MikeSeymour When you say GNU implementation are you talking gcc? Cause on gcc 4.9.2 I'm seeing just the strings, not any gibberish before them: http://ideone.com/w6B55W – Jonathan Mee Mar 13 '15 at 15:04
  • 1
    @JonathanMee: The pointer points to the string data, so printing that just prints the string. There's some more information stored in the memory before the string, which the implementation can access by subtracting some offset from the pointer. – Mike Seymour Mar 13 '15 at 15:06
  • @MikeSeymour So that's what you were talking about here: http://stackoverflow.com/questions/27687751/cast-a-vector-of-stdstring-to-char/29031944?noredirect=1#comment46307814_29031944 – Jonathan Mee Mar 13 '15 at 15:09
  • @JonathanMee: Yes, I gave a sketch answer to the same question there. – Mike Seymour Mar 13 '15 at 15:12
  • @MikeSeymour Welp, I learned a lot today. I love this site for that, but man if the down voting doesn't get pretty brutal. – Jonathan Mee Mar 13 '15 at 15:27
2

The behaviour depends on your STL implementation (just revise std::vector and std::string source code). Occasionaly, you have the string impl that stores (as other participants mentioned) pointer to chars buffer as a member.

It's not a secret that one shoudn't rely on incapsulated details of implementation due to undefined behaviour it causes.

2

After Neil Kirk mentioned this in a comment on the answer that originally sparked all this, I looked it up.

string is a specialization of basic_string on all implementations.

Now I only have access to Visual Studio's 2013 version of xstring.h (here Microsoft implements basic_string) so this may be different for other versions or compilers. But in xstring.h basic_string inherits from _String_alloc which inherits from _String_val.

_String_val is actually the first in the inheritance chain which has any member variables. It's first member variable, _Bx, is a union which will translate to a char* for string (not for wstring).

So when a string is cast to a char* on Visual Studio 2013 it is a char* which begins pointing to the member variable: _Bx Since _Bx is actually a '\0'-terminated char* you can cout it and it behave's properly.

Now what I didn't know, and what all this research taught me, is that _String_val also contains a size variable, _Mysize, and a reserved size, _Myres. If either of those had been declared in _String_val before _Bx this would have outputted gibberish at the start of cout's output each line.

I'd conclude by conceding that as is mentioned by the other answers this behavior is implementation dependent, and may not work across diferent versions or platforms.

Community
  • 1
  • 1
Jonathan Mee
  • 37,899
  • 23
  • 129
  • 288
0

As The Parametric Croissant's comment suggest, it is necessary for this to work that the char[] member of the string class is the first member so that the string address == the char[] beginning.

I couldn't find any explicit mention of this in the standard. It is possible that some other rule in the standard implicitly imposes this one, but I didn't find one.

Therefore you shouldn't rely on it.

Nota : Another more obvious necessity is that std::vector provides contiguous memory space, but this is specififed.

Félix Cantournet
  • 1,941
  • 13
  • 17