6

PHP have an internal data-structure called smart string (smart_str?), where they store both length and buffer size. That is, more memory than the length of the string is allocated to improve concatenation performance. Why isn't this data-structure used for the actual PHP strings? Wouldn't that lead to fewer memory allocations and better performance?

Olle Härstedt
  • 3,799
  • 1
  • 24
  • 57

1 Answers1

6

Normal PHP strings (as of PHP 7) are represented by the zend_string type, which includes both the length of the string and its character data array. zend_strings are usually allocated to fit the character data precisely (alignment notwithstanding): They will not leave place to append additional characters.

The smart_str structure includes a pointer to a zend_string and an allocation size. This time, the zend_string will not be precisely allocated. Instead the allocation will be made too large, so that additional characters can be appended without expensive reallocations.

The reallocation policy for smart_str is as follows: First, it will be allocated to have a total size of 256 bytes (minus the zend_string header, minus allocator overhead). If this size is exceeded it will be reallocated to 4096 bytes (minus overhead). After that, the size will increase in increments of 4096 bytes.

Now, imagine that we replace all strings with smart_strings. This would mean that even a single character string would have a minimum allocation size of 256 bytes. Given that most strings in use are small, this is an unacceptable overhead.

So essentially, this is a classic performance/memory tradeoff. We use a memory-compact representation by default and switch to a faster, but less memory-effective representation in the cases that benefit most from it, i.e. cases where large strings are constructed from small parts.

Andrea
  • 19,134
  • 4
  • 43
  • 65
NikiC
  • 100,734
  • 37
  • 191
  • 225
  • Sure, but you could still tune the `smart_str` to better fit the need of normal PHP string handling, right? By starting with small size and then double it each time concatenation happens. Especially since string buffers are impossible to implement in PHP (!). And also especially since memory is more abundant than CPU cycles. – Olle Härstedt Oct 16 '15 at 14:16
  • 1
    @OlleHärstedt Yes, it's likely possible to find some reasonable allocation policy once you start storing capacity at all. I was answering about smart_str specifically here. One relatively safe thing to do is to integrate with the allocator and (for small allocs) choose the next largest bucket size that will be used anyway. With a bit of trickery, it would even be possible to introduce no additional memory overhead for storing the capacity (using pseudo-float encoding). This is what HHVM does ;) – NikiC Oct 16 '15 at 15:00
  • Hm, do you have a link to explain that trickery? Sounds interesting. – Olle Härstedt Oct 16 '15 at 15:25
  • 1
    @OlleHärstedt Sure, here's the cap code implementation: http://lxr.php.net/xref/OTHER_IMPLEMENT/hiphop-vm/hphp/runtime/base/cap-code.h#21 The reason why it would have no additional memory overhead for PHP is that we can safely shrink the hash cache to 32bit, leaving 32bit for an encoded capacity value. Here's HHVMs implementation for choosing the string capacity: http://lxr.php.net/xref/OTHER_IMPLEMENT/hiphop-vm/hphp/runtime/base/string-data.cpp#91 – NikiC Oct 16 '15 at 15:38