23

Is it a two-pass algorithm? i.e., it iterates the enumerable once to count the number of elements so that it can allocate the array, and then pass again to insert them?

Does it loop once, and keep resizing the array?

Or does it use an intermediate structure like a List (which probably internally resizes an array)?

mpen
  • 272,448
  • 266
  • 850
  • 1,236
  • 2
    My suggestion would be to download .NET Reflector and look at the source for yourself. – Justin Niessner Dec 02 '10 at 21:38
  • 1
    @Justin _or_ just using the Framework reference source from Microsoft. They tend to have comments and nicer variable names in them :-) It's what I used to research my answer. – driis Dec 03 '10 at 08:36

5 Answers5

19

It uses an intermediate structure. The actual type involved is a Buffer, which is an internal struct in the framework. In practice, this type has an array, that is copied each time it is full to allocate more space. This array starts with length of 4 (in .NET 4, it's an implementation detail that might change), so you might end up allocating and copying a lot when doing ToArray.

There is an optimization in place, though. If the source implementes ICollection<T>, it uses Count from that to allocate the correct size of array from the start.

driis
  • 161,458
  • 45
  • 265
  • 341
  • The worst-case total amount of temporary space allocated will be three times the resulting array size, and each item will be written at most once. The total space required could be reduced to twice the array size if overflowing an array meant allocating a twice-as-big array for new data (leaving old data where it is), but even 3x isn't huge. Note that while some items may be copied a dozen or more times, the average will never exceed 3x. – supercat Dec 02 '10 at 21:50
  • 1
    supercat: Note that the algorithm still uses O(n) space and time. – Gabe Dec 02 '10 at 22:09
  • why isn't it made unsafe, to make use of stack allocation, which would be a signicant boost on temporary objects? – Andriy Shevchenko Oct 22 '17 at 16:30
  • @supercat how did you come up with those numbers? Can you provide an example? – David Klempfner Dec 07 '19 at 00:36
  • @Backwards_Dave: Each time the buffer fills up, half of the items will have been written once (after the last time it filled up) and half will be copied from the previous buffer. Half of the items in that previous buffer will have been written once, and half from an earlier buffer, etc. Just after a buffer fills up, half the items in the new buffer will have been copied, and one of them written once. – supercat Dec 07 '19 at 01:54
  • @supercat if you have a source IEnumerable with 0,1,2,3,4, and you call ToArray(), you'll have 3 arrays created (the initial one with length =4, the temp one that is doubled in size (Length = 8), then the last one that is the exact size needed ie. Length = 5). Total amount of temp space allocated = 8+4=12, but 3 * the resulting array size is 3*5=15. – David Klempfner Dec 07 '19 at 03:47
  • @supercat I wonder why doesn't it do this "allocating a twice-as-big array for new data (leaving old data where it is)". Wouldn't it have better performance, since no items would be copied, except once, to the destination array? To clarify, I'm thinking of a struct with array and pointer to next struct with (2x size) array (works like a single-linked-list of arrays), then just iterating it forward copying those to destination array. Wouldn't it be more efficient in terms of both space and size? – geekley Jun 01 '22 at 01:16
  • Ah I see now that in [more recent dotnet implementation](https://source.dot.net/#System.Linq/EnumerableHelpers.Linq.cs,3ce3c6387bacf2ca,references) it seems to do something kinda like what I was thinking (?), using this `LargeArrayBuilder` class. It has an array of buffers (each with 2x size of previous), and it copies from each of them into destination array. Not sure what version that is but that implementation is not [what's used in .NET Framework 4.8](https://referencesource.microsoft.com/#System.Core/System/Linq/Enumerable.cs,ed118118b642d9d4,references). – geekley Jun 01 '22 at 01:56
  • 1
    @geekley: Doubling in size iteratively is a reasonable approach so long as the temporary collections stay small enough to avoid being placed in the Large Object Heap. Otherwise, I think a linked list of nodes, each of which holds a T[] and a link to the next node would probably be a good approach. I have no idea if .NET yet has a good way of determining the largest array size that would avoid Large Object Heap allocations. – supercat Jun 01 '22 at 16:02
10

First it checks to see if the source is an ICollection<T>, in which case it can call the source's ToArray() method.

Otherwise, it enumerates the source exactly once. As it enumerates it stores items into a buffer array. Whenever it hits the end of the buffer array it creates a new buffer of twice the size and copies in the old elements. Once the enumeration is finished it returns the buffer (if it's the exact right size) or copies the items from the buffer into an array of the exact right size.

Here's pseudo-source code for the operation:

public static T[] ToArray<T>(this IEnumerable<T> source)
{
    T[] items = null;
    int count = 0;

    foreach (T item in source)
    {
        if (items == null)
        {
            items = new T[4];
        }
        else if (items.Length == count)
        {
            T[] destinationArray = new T[count * 2];
            Array.Copy(items, 0, destinationArray, 0, count);
            items = destinationArray;
        }
        items[count] = item;
        count++;
    }

    if (items.Length == count)
    {
        return items;
    }
    T[] destinationArray = new TElement[count];
    Array.Copy(items, 0, destinationArray, 0, count);
    return destinationArray;
}
Gabe
  • 84,912
  • 12
  • 139
  • 238
6

Like this (via .NET Reflector):

public static TSource[] ToArray<TSource>(this IEnumerable<TSource> source)
{
    if (source == null)
    {
        throw Error.ArgumentNull("source");
    }
    Buffer<TSource> buffer = new Buffer<TSource>(source);
    return buffer.ToArray();
}

[StructLayout(LayoutKind.Sequential)]
internal struct Buffer<TElement>
{
    internal TElement[] items;
    internal int count;
    internal Buffer(IEnumerable<TElement> source)
    {
        TElement[] array = null;
        int length = 0;
        ICollection<TElement> is2 = source as ICollection<TElement>;
        if (is2 != null)
        {
            length = is2.Count;
            if (length > 0)
            {
                array = new TElement[length];
                is2.CopyTo(array, 0);
            }
        }
        else
        {
            foreach (TElement local in source)
            {
                if (array == null)
                {
                    array = new TElement[4];
                }
                else if (array.Length == length)
                {
                    TElement[] destinationArray = new TElement[length * 2];
                    Array.Copy(array, 0, destinationArray, 0, length);
                    array = destinationArray;
                }
                array[length] = local;
                length++;
            }
        }
        this.items = array;
        this.count = length;
    }

    internal TElement[] ToArray()
    {
        if (this.count == 0)
        {
            return new TElement[0];
        }
        if (this.items.Length == this.count)
        {
            return this.items;
        }
        TElement[] destinationArray = new TElement[this.count];
        Array.Copy(this.items, 0, destinationArray, 0, this.count);
        return destinationArray;
    }
}
StriplingWarrior
  • 151,543
  • 27
  • 246
  • 315
2

First, the items are loaded into an internal class Buffer<T> which allows the count to be generated

Next, Buffer<T>.ToArray is called, which does an Array.Copy of the Buffer<T>'s array into a returned array.

.NET Reflector shows this code if you want to see for yourself.

http://www.red-gate.com/products/reflector/

David Klempfner
  • 8,700
  • 20
  • 73
  • 153
Martin Peck
  • 11,440
  • 2
  • 42
  • 69
2

In general, attempting to iterate an enumerable twice can lead to a disaster as there is no guarantee that the enumerable can be iterated a second time. Therefore, performing a Count and then allocate then copy is out.

In Reflector, it shows that it uses a type called Buffer that effectively streams the sequence into an array resizing (doubling on each reallocation so that the number of reallocations is O(log n)) as needed and then returning an appropriately sized array when it reaches the end

jason
  • 236,483
  • 35
  • 423
  • 525