Understanding the bug
The unexpected behavior occurs because of quirks in a depreciated implementation of std::string
. Older versions of GCC implemented std::string
using copy-on-write semantics. It's a clever idea, but it causes bugs like the one you're seeing. What that means is that GCC tried to define std::string
so that the internal string buffer only got copied if the new std::string
was modified. For example:
std::string A = "Hello, world";
std::string B = A; // No copy occurs (yet)
A[3] = '*'; // Copy occurs now because A got modified.
When you take a constant pointer, however, no copy occurs because the library assumes that the string will not be modified through that pointer:
std::string A = "Hello, world";
std::string B = A;
std::string const& A_ref = A;
const_cast<char&>(A_ref[3]) = '*'; // No copy occurs (your bug)
As you've noticed, copy-on-write semantics tends to cause bugs. Because of this, and because copying a string is pretty cheap (all things considered), the copy copy-on-write implementation of std::string
was depreciated and removed in GCC 5.
So why are you seeing this bug if you're using GCC 5? It's likely that you're compiling and linking an older version of the C++ standard library (one where copy-on-write is still the implementation of std::string
). This is what's causing the bug for you.
Check which version of the C++ standard library you're compiling against, and if possible, update your compiler.
How can I tell which implemenation of std::string
my compiler is using?
- New GCC implementation:
sizeof(std::string) == 32
(when compiling for 64 bit)
- Old GCC implementation:
sizeof(std::string) == 8
(when compiling for 64 bit)
If your compiler is using the old implementation of std::string
, then sizeof(std::string)
is the same as sizeof(char*)
because std::string
is implemented as a pointer to a block of memory. The block of memory is the one that actually contains things like the size and capacity of the string.
struct string { //Old data layout
size_t* _data;
size_t size() const {
return *(data - SIZE_OFFSET);
}
size_t capacity() const {
return *(data - CAPACITY_OFFSET);
}
char const* data() const {
return (char const*)_data;
}
};
On the other hand, if you're using the newer implementation of std::string
, then sizeof(std::string)
should be 32 bytes (on 64 bit systems). This is because the newer implementation stores the size and capacity of the string within the std::string
itself, rather than in the data it points to:
struct string { // New data layout
char* _data;
size_t _size;
size_t _capacity;
size_t _padding;
// ...
};
What's good about the new implementation? The new implementation has a number of benefits:
- Accessing size and capacity can be done more quickly (since the optimizer is more likely to store them in the registers, or at the very least they're likely to be in the cache)
- Because
std::string
is 32 bytes, we can take advantage of Small String Optimization. Small String Optimization allows strings less than 16 characters long to be stored within the space normally taken up by _capacity
and _padding
. This avoids heap allocations, and is faster for most use cases.
We can see below that GDB uses the old implementation of std::string
, because sizeof(std::string)
returns 8 bytes:
