19

In C, I'm working on a "class" that manages a byte buffer, allowing arbitrary data to be appended to the end. I'm now looking into automatic resizing as the underlying array fills up using calls to realloc. This should make sense to anyone who's ever used Java or C# StringBuilder. I understand how to go about the resizing. But does anyone have any suggestions, with rationale provided, on how much to grow the buffer with each resize?

Obviously, there's a trade off to be made between wasted space and excessive realloc calls (which could lead to excessive copying). I've seen some tutorials/articles that suggest doubling. That seems wasteful if the user manages to supply a good initial guess. Is it worth trying to round to some power of two or a multiple of the alignment size on a platform?

Does any one know what Java or C# does under the hood?

Brian McFarland
  • 9,052
  • 6
  • 38
  • 56
  • 1
    IIRC, the .NET StringBuilder will at least double its current buffer size if you try to append something that will require a size increase. – Chris Farmer Apr 17 '12 at 18:39
  • The matematical amount is 1.6180339887...: the golden ratio, *but just use 2* – pmg Apr 17 '12 at 18:46
  • 3
    @ChrisFarmer: That was the strategy in the past; the current version uses a different strategy. – Eric Lippert Apr 17 '12 at 18:50

7 Answers7

41

In C# the strategy used to grow the internal buffer used by a StringBuilder has changed over time.

There are three basic strategies for solving this problem, and they have different performance characteristics.

The first basic strategy is:

  • Make an array of characters
  • When you run out of room, create a new array with k more characters, for some constant k.
  • Copy the old array to the new array, and orphan the old array.

This strategy has a number of problems, the most obvious of which is that it is O(n2) in time if the string being built is extremely large. Let's say that k is a thousand characters and the final string is a million characters. You end up reallocating the string at 1000, 2000, 3000, 4000, ... and therefore copying 1000 + 2000 + 3000 + 4000 + ... + 999000 characters, which sums to on the order of 500 billion characters copied!

This strategy has the nice property that the amount of "wasted" memory is bounded by k.

In practice this strategy is seldom used because of that n-squared problem.

The second basic strategy is

  • Make an array
  • When you run out of room, create a new array with k% more characters, for some constant k.
  • Copy the old array to the new array, and orphan the old array.

k% is usually 100%; if it is then this is called the "double when full" strategy.

This strategy has the nice property that its amortized cost is O(n). Suppose again the final string is a million characters and you start with a thousand. You make copies at 1000, 2000, 4000, 8000, ... and end up copying 1000 + 2000 + 4000 + 8000 ... + 512000 characters, which sums to about a million characters copied; much better.

The strategy has the property that the amortized cost is linear no matter what percentage you choose.

This strategy has a number of downside that sometimes a copy operation is extremely expensive, and you can be wasting up to k% of the final string length in unused memory.

The third strategy is to make a linked list of arrays, each array of size k. When you overflow an existing array, a new one is allocated and appended to the end of the list.

This strategy has the nice property that no operation is particularly expensive, the total wasted memory is bounded by k, and you don't need to be able to locate large blocks in the heap on a regular basis. It has the downside that finally turning the thing into a string can be expensive as the arrays in the linked list might have poor locality.

The string builder in the .NET framework used to use a double-when-full strategy; it now uses a linked-list-of-blocks strategy.

Eric Lippert
  • 647,829
  • 179
  • 1,238
  • 2,067
  • Just to add Google Fodder, isn't this also called a Rope? http://is.gd/zsPpJT - or are ropes more sophisticated than simply linking arrays together? – Michael Stum Apr 17 '12 at 19:24
  • 3
    @MichaelStum: Ropes can be that simple, or can be a more generalized data structure for representing cheap concatenation of strings. I once spent a summer adding ropes to the internal string representation of the VBScript language and ultimately ended up abandoning the work; the added complexity of the rope class and its attendant overhead ended up costing more in typical scenarios than the savings in unlikely scenarios would justify. – Eric Lippert Apr 17 '12 at 19:37
  • 1
    @EricLippert, starting with which version does it use the linked list strategy? – Roman Royter Apr 17 '12 at 20:14
  • 3
    @RomanRoyter the linked list string builder strategy was introduced in .NET Framework 4. – phoog Apr 17 '12 at 20:59
  • I find the linked-list choice strange. I would have expected an array of arrays (`char[][]`) to allow fast random access, which `StringBuilder` requires. – CodesInChaos Apr 18 '12 at 06:59
  • @CodeInChaos, with just a `char[][]`, you're just adding another expanding buffer (which could be implemented with solution #2), or you always allocate a large array of `char[]` that contains `max_length/k' entries where k is the length of each `char[]`. Also, this does nothing to simplify random write access, just random read access. For random write, you need to be able to insert a new array between two others and keep track of the length of each array OR do a ton of copying. – Brian McFarland Apr 18 '12 at 14:01
  • 2
    @EricLippert, thanks for providing a great analysis of each method as opposed to a "this is the right way to do it" answer. Particularly for explaining the O(n) vs O(n^2) aspect of method #1 vs method #2. Haven't decided yet if method #3 works in my use case. – Brian McFarland Apr 18 '12 at 14:58
  • 2
    @BrianMcFarland: You're welcome. And of course there are more sophisticated techniques you can use as well; you can use data structures like catenable deques, for example. But it is hard to beat the sheer speed of a big array; processors are optimized for contiguous data. – Eric Lippert Apr 18 '12 at 15:24
  • @BrianMcFarland It helps with write access, if you overwrite instead of inserting/deleting. An O(n) indexing is certainly surprising, even if the constant is likely very small. Fast insertions are a problem with a simple linked list too. You'd need to go with a full blown tree to solve it cleanly, and that would cause a relatively big constant overhead, and would also complicate the design. `char[][]` seems like a good trade-off to me. Constant indexer access(read and write) and cheap appending. Insertion/deletion has similar properties compared with a linked list. – CodesInChaos Apr 18 '12 at 15:31
  • 1
    @CodeInChaos, I agree that a tree (provided it's self-balancing) would be superior to a list. But I think your missing the point that an array is fixed size. In fact, in C, you cannot declare a `char[][]` and allocate the arrays containing data on the fly. You need would need to do `char* data_buf[max_len/buf_size]` and then malloc the pointers stored in data_buf or do `char **data_buf` and allow your array of pointers to be resized with `realloc` as well. In any language where an array size is mutable, it's actually an object like a C++ `std::vector`. – Brian McFarland Apr 18 '12 at 19:21
  • 2
    @EricLippert Do you know the reasoning behind the decision to switch the StringBuilder implementation in .NET? It seems like to me the most common cases of StringBuilder would be for use with relatively short strings, and a previous blog post of yours seems to confirm that sentiment. – MgSam Apr 19 '12 at 00:55
  • @MgSam one could speculate that the BCL people spotted a usage trend of being either very small, or very big and growing organically (no presizing). In the very big case Method 2 is very unplesant in a non compacting memory region (the LOH). Method 2 is vastly nicer there and not too bad in the small case since it likely fits in just one of the blocks, gets a small constant time overhead but keeps the benefits of contiguous memory. That's speculation of course. – ShuggyCoUk Apr 19 '12 at 08:41
8

You generally want to keep the growth factor a little smaller than the golden mean (~1.6). When it's smaller than the golden mean, the discarded segments will be large enough to satisfy a later request, as long as they're adjacent to each other. If your growth factor is larger than the golden mean, that can't happen.

I've found that reducing the factor to 1.5 still works quite nicely, and has the advantage of being easy to implement in integer math (size = (size + (size << 1))>>1; -- with a decent compiler you can write that as (size * 3)/2, and it should still compile to fast code).

I seem to recall a conversation some years ago on Usenet, in which P.J. Plauger (or maybe it was Pete Becker) of Dinkumware, saying they'd run rather more extensive tests than I ever did, and reached the same conclusion (so, for example, the implementation of std::vector in their C++ standard library uses 1.5).

Jerry Coffin
  • 476,176
  • 80
  • 629
  • 1,111
  • This is a very close second for being my accepted answer since it's a good explanation of what I think I'll end up using. But I like that's Eric's answer contrasts each common approach. – Brian McFarland Apr 17 '12 at 19:20
2

When working with expanding and contracting buffers, the key property you want is to grow or shrink by a multiple of your size, not a constant difference.

Consider the case where you have a 16 byte array, increasing its size by 128 bytes is overkill; however, if instead you had a 4096 byte array and increased it by only 128 bytes, you would end up copying a lot.

I was taught to always double or halve arrays. If you really have no hint as to the size or maximum, multiplying by two ensures that you have a lot of capacity for a long time, and unless you're working on a resource constrained system, allocating at most twice the space isn't too terrible. Additionally, keeping things in powers of two can let you use bit shifts and other tricks and the underlying allocation is usually in powers of two.

Michael
  • 2,181
  • 14
  • 14
1

Does any one know what Java or C# does under the hood?

Have a look at the following link to see how it's done in Java's StringBuilder from JDK11, in particular, the ensureCapacityInternal method. https://java-browser.yawk.at/java/11/java.base/java/lang/AbstractStringBuilder.java#java.lang.AbstractStringBuilder%23ensureCapacityInternal%28int%29

Some Guy
  • 405
  • 8
  • 15
Rich Drummond
  • 3,439
  • 1
  • 15
  • 16
0

It's implementation-specific, according to the documentation, but starts with 16:

The default capacity for this implementation is 16, and the default maximum capacity is Int32.MaxValue.

A StringBuilder object can allocate more memory to store characters when the value of an instance is enlarged, and the capacity is adjusted accordingly. For example, the Append, AppendFormat, EnsureCapacity, Insert, and Replace methods can enlarge the value of an instance.

The amount of memory allocated is implementation-specific, and an exception (either ArgumentOutOfRangeException or OutOfMemoryException) is thrown if the amount of memory required is greater than the maximum capacity.

Based on some other .NET framework things, I would suggest multiplying it by 1.1 each time the current capacity is reached. If extra space is needed, just have an equivalent to EnsureCapacity that will expand it to the necessary size manually.

Ry-
  • 218,210
  • 55
  • 464
  • 476
0

Translate this to C.

I will probably maitain a List<List<string>> list.

class StringBuilder
{
   private List<List<string>> list;

   public Append(List<string> listOfCharsToAppend)
   {

       list.Add(listOfCharsToAppend);
   }

}

This way you are just maintaining a list of Lists and allocating memory on demand rather than allocating memory well ahead.

phoog
  • 42,068
  • 6
  • 79
  • 117
Sandeep
  • 7,156
  • 12
  • 45
  • 57
  • 2
    It also means that growth is linear instead of amortized constant, and that if each string being added is short (as is frequently the case), you waste a *lot* of space on pointers -- in the fairly common case of building a string one character at a time, on (say) a 64-bit system, you'd have 8 bytes of pointers to hold 1 byte of the string... – Jerry Coffin Apr 17 '12 at 18:59
0

List in .NET framework uses this algorithm: If initial capacity is specified, it creates buffer of this size, otherwise no buffer is allocated until first item(s) is added, which allocates space equal to number of item(s) added, but no less than 4. When more space is needed, it allocates new buffer with 2x previous capacity and copies all items from old buffer to new buffer. Earlier StringBuilder used similar algorithm.

In .NET 4, StringBuilder allocates initial buffer of size specified in constructor (default size is 16 characters). When allocated buffer is too small, no copying is made. Instead it fills current buffer to the rim, then creates new instance of StringBuilder, which allocates buffer of size *MAX(length_of_remaining_data_to_add, MIN(length_of_all_previous_buffers, 8000))* so at least all remaining data fits to new buffer and total size of all buffers is at least doubled. New StringBuilder keeps reference to old StringBuilder and so individual instances creates linked list of buffers.

Ňuf
  • 6,027
  • 2
  • 23
  • 26