2

I am looping through a potentially huge (millions of items) dataset (stored on disk) and pulling out selected items which I am adding to a List<T>. When I add an item to the list, I put a lock around it, as there are other threads accessing the list.

I am trying to decide between two possible implementations:

1) Lock the list every time I need to add an item.

2) Use a temporary list that I add items to as I find them, and then use List<T>.AddRange() to add the items in that list in a chunk (e.g. when I have found 1000 matches). This results in needing to request a lock on the list less often, but if AddRange() only increases the capacity enough to exactly accommodate the new items then the list will end up being re-sized a lot more times.

My question is this: As I understand it, adding items one at a time will cause the internal capacity of a List<T> to double in size every time the capacity is reached, but I don't know how List<T>.AddRange() behaves. I would assume that it only adds enough capacity to accommodate the new items, but I can't find any way to confirm this. The description of how the capacity is increased on MSDN is almost identical for Add() and AddRange(), except that for AddRange it says that if the new count is greater than the capacity the capacity is increased rather than if the Count is already the same as the capacity.
To me this reads as if using AddRange() to add enough items to go over the current capacity would cause the capacity to be increased in the same way that going over the current capacity using Add() would.

So, will adding items using List<T>.AddRange() in a chunk large enough to exceed the current capacity cause the capacity to increase only enough to accommodate the new items, or will it cause the capacity to double? Or does it do something else that I've not even considered?

Hopefully this is clear enough without any code samples as it is a general question about how List<T> is implemented, but if not I will add any that will make my question clearer. As mentioned, I've read the MSDN documentation and couldn't find a clear answer. I searched for any similar questions on here as well and couldn't find any, but if there's one I've missed please point me to it!

  • [This question](http://stackoverflow.com/questions/2123161/listt-addrange-implementation-suboptimal?rq=1) may also be relevant if you're concerned about performance – Mgetz Sep 02 '13 at 12:24

3 Answers3

7

As long as the collection passed as AddRange parameter implements ICollection<T> the array size is incremented just once:

ICollection<T> collection2 = collection as ICollection<T>;
if (collection2 != null)
{
    int count = collection2.Count;
    if (count > 0)
    {
        this.EnsureCapacity(this._size + count);

    // (...)

otherwise standard enumeration and Insert method call for each element is done:

}
else
{
    using (IEnumerator<T> enumerator = collection.GetEnumerator())
    {
        while (enumerator.MoveNext())
        {
            this.Insert(index++, enumerator.Current);
        }
    }
}

Edit

Look into EnsureCapacity method:

private void EnsureCapacity(int min)
{
    if (this._items.Length < min)
    {
        int num = (this._items.Length == 0) ? 4 : (this._items.Length * 2);
        if (num > 2146435071)
        {
            num = 2146435071;
        }
        if (num < min)
        {
            num = min;
        }
        this.Capacity = num;
    }
}

It increases the array size by Max(old_size * 2, min), and because it's being called with min = old_size + count the final array size after AddRange call will be set to Max(old_size * 2, old_size + count) - it will wary on current List<T> size and size of collection that is added using AddRange method.

MarcinJuraszek
  • 124,003
  • 15
  • 196
  • 263
  • 2
    I assume this is copypasta from the reference source? – Mgetz Sep 02 '13 at 12:22
  • @Mgetz I used ILSpy to look inside mscorelib.dll, but don't think the source will differ. – MarcinJuraszek Sep 02 '13 at 12:23
  • Thanks for your answer. I'm aware that it only increases the array size once, I want to know how much it increases the array by: if it only increases it by the size of the collection being added or if it doubles it like it does when exceeding the current capacity with Add(). Sorry if that wasn't clear! Judging from the EnsureCapacity line, it looks like it only increases the capacity to be just big enough to contain the added items? – phoebelmurphy Sep 02 '13 at 12:24
  • @phoebelmurphy my comment above was as much curiosity as it was a hint that you should take a look at the reference source or use ILSpy like Marcin did that should answer your question. – Mgetz Sep 02 '13 at 12:25
  • @phoebelmurphy I've updated my answer. Hope it's clear enough now. – MarcinJuraszek Sep 02 '13 at 12:29
3

The capacity is increased in the same way as with Add. This is not explicitly mentioned in the documentation, but a look at the source code shows that both Add and AddRange internally use EnsureCapacity.

Daniel Hilgarth
  • 171,043
  • 40
  • 335
  • 443
0

AddRange will only increase the size only to the necessary amount. So in the AddRange function you could find something like the following code:

if(capacity < count + items.Count)
{
  capacity = count + items.Count;
}

Edit: Turns out the items might be added one by one.

But if you're working with really large data sets and read performance is important, it's probably better to use a binary tree. That will allow faster search, adding, removing and partial locking, leaving the rest of the tree usable. The biggest problem with tree's is when to rebalance. I used this tree in my chess game, which is rebalanced after every move (because that's when removals are needed and thats not thread-safe with this implementation):

namespace Chess
{
    /// <summary>
    /// Implements using a binary search tree.
    /// Is thread-safe when adding, not when removing.
    /// </summary>
    public class BinaryTree
    {
        public MiniMax.Node info;
        public BinaryTree left, right;

        /// <summary>
        /// Collisions are handled by returning the existing node. Thread-safe
        /// Does not recalculate height, do that after all positions are added.
        /// </summary>
        /// <param name="info">Connector in a tree structure</param>
        /// <returns>Node the position was already store in, null if new node.</returns>
        public MiniMax.Node AddConnection(MiniMax.Node chessNode)
        {
            if (this.info == null)
            {
                lock (this)
                {
                    // Must check again, in case it was changed before lock.
                    if (this.info == null)
                    {
                        this.left = new BinaryTree();
                        this.right = new BinaryTree();
                        this.info = chessNode;
                        return null;
                    }
                }
            }

            int difference = this.info.position.CompareTo(chessNode.position);

            if (difference < 0) return this.left.AddConnection(chessNode);
            else if (difference > 0) return this.right.AddConnection(chessNode);
            else
            {
                this.info.IncreaseReferenceCount();
                return this.info;
            }
        }

        /// <summary>
        /// Construct a new Binary search tree from an array.
        /// </summary>
        /// <param name="elements">Array of elements, inorder.</param>
        /// <param name="min">First element of this branch.</param>
        /// <param name="max">Last element of this branch.</param>
        public void CreateFromArray(MiniMax.Node[] elements, int min, int max)
        {
            if (max >= min)
            {
                int mid = (min + max) >> 1;
                this.info = elements[mid];

                this.left = new BinaryTree();
                this.right = new BinaryTree();

                // The left and right each have one half of the array, exept the mid.
                this.left.CreateFromArray(elements, min, mid - 1);
                this.right.CreateFromArray(elements, mid + 1, max);
            }
        }

        public void CollectUnmarked(MiniMax.Node[] restructure, ref int index)
        {
            if (this.info != null)
            {
                this.left.CollectUnmarked(restructure, ref index);

                // Nodes marked for removal will not be added to the array.
                if (!this.info.Marked)
                    restructure[index++] = this.info;

                this.right.CollectUnmarked(restructure, ref index);
            }
        }

        public int Unmark()
        {
            if (this.info != null)
            {
                this.info.Marked = false;
                return this.left.Unmark() + this.right.Unmark() + 1;
            }
            else return 0;
        }
    }
}
MrFox
  • 4,852
  • 7
  • 45
  • 81
  • Your description about what `AddRange` does to the capacity is simply incorrect. – Daniel Hilgarth Sep 02 '13 at 12:34
  • There is no `capacity = count + items.Count)` within `AddRange` method. – MarcinJuraszek Sep 02 '13 at 12:34
  • I din't mean that exact code, just that somehow it reserves enough space for all items that need to be copied. – MrFox Sep 02 '13 at 12:39
  • 1
    @MrFox: As you can see from the other answers, especially Marcin's, this is far from accurate. In fact, the way it is formulated, it is not just inaccurate, but wrong. The capacity can be increased a lot more than just to the size that would be necessary to contain the new items. – Daniel Hilgarth Sep 02 '13 at 12:42
  • My data is already sorted, I am just going through and picking out specific pieces of it. I access items in the list by their index and always add them at the end. Please correct me if I'm wrong! but I believe that in this case there isn't any read or write advantage to a binary tree? Isn't the main advantage the sorting of unsorted data? – phoebelmurphy Sep 02 '13 at 16:34