13

If I have, say, 100 items that'll be stored in a dictionary, should I initialise it thus?

var myDictionary = new Dictionary<Key, Value>(100);

My understanding is that the .NET dictionary internally resizes itself when it reaches a given loading, and that the loading threshold is defined as a ratio of the capacity.

That would suggest that if 100 items were added to the above dictionary, then it would resize itself when one of the items was added. Resizing a dictionary is something I'd like to avoid as it has a performance hit and is wasteful of memory.

The probability of hashing collisions is proportional to the loading in a dictionary. Therefore, even if the dictionary does not resize itself (and uses all of its slots) then the performance must degrade due to these collisions.

How should one best decide what capacity to initialise the dictionary to, assuming you know how many items will be inside the dictionary?

Drew Noakes
  • 300,895
  • 165
  • 679
  • 742

6 Answers6

6

What you should initialize the dictionary capacity to depends on two factors: (1) The distribution of the gethashcode function, and (2) How many items you have to insert.

Your hash function should either be randomly distributed, or it is should be specially formulated for your set of input. Let's assume the first, but if you are interested in the second look up perfect hash functions.

If you have 100 items to insert into the dictionary, a randomly distributed hash function, and you set the capacity to 100, then when you insert the ith item into the hash table you have a (i-1) / 100 probability that the ith item will collide with another item upon insertion. If you want to lower this probability of collision, increase the capacity. Doubling the expected capacity halves the chance of collision.

Furthermore, if you know how frequently you are going to be accessing each item in the dictionary you may want to insert the items in order of decreasing frequency since the items that you insert first will be on average faster to access.

hhawk
  • 186
  • 1
  • 3
  • 1
    wow, inserting frequently used items before seldom used items to increase performance. I never thought of that. – rocketsarefast Jun 26 '12 at 14:01
  • The there are *requirement* that the physical hash buckets actually *align* to the capacity specified? I would imagine it is free to pick a suitable bucket count as long as it conforms to "The capacity of a Dictionary is the number of elements that can be added to the Dictionary before resizing is necessary." –  Aug 10 '12 at 19:29
6

Improved benchmark:

  • Hardware: Intel Core i7-10700K x64, .NET 5, Optimized build. LINQPad 6 for .NET 5 run and LINQPad 5 for .NET Fx 4.8 run.
  • Times are in fractional milliseconds to 3 decimal places.
    • 0.001ms is 1 microsecond.
    • I am unsure of the actual resolution of Stopwatch as it's system-dependent, so don't stress over differences at the microsecond level.
  • Benchmark was re-run dozens of times with consistent results. Times shown are averages of all runs.
  • Conclusion: Consistent 10-20% overall speedup by setting capacity in the Dictionary<String,String> constructor.

.NET: .NET Framework 4.8 .NET 5
With initial capacity of 1,000,000
Constructor 1.170ms 0.003ms
Fill in loop 353.420ms 181.846ms
Total time 354.590ms 181.880ms
Without initial capacity
Constructor 0.001ms 0.001ms
Fill in loop 400.158ms 228.687ms
Total time 400.159ms 228.688ms
Speedup from setting initial capacity
Time 45.569ms 46.8ms
Speedup % 11% 20%
  • I did repeat the benchmark for smaller initial sizes (10, 100, 1000, 10000, and 100000) and the 10-20% speedup was also observed at those sizes, but in absolute terms a 20% speedup on an operation that takes a fraction of a millisecond
  • While I saw consistent results (the numbers shown are averages), but there are some caveats:
    • This benchmark was performed with a rather extreme size of 1,000,000 items but with tight-loops (i.e. not much else going on inside the loop body) which is not a realistic scenario. So always profile and benchmark your own code to know for sure rather than trusting a random benchmark you found on the Internet (just like this one).
    • The benchmark doesn't isolate the time spent generating the million or so String instances (caused by i.ToString().
    • A reference-type (String) was used for both keys and values, which uses the same size as a native pointer size (8 bytes on x64), so results will be different when re-run if the keys and/or values use a larger value-type (such as a ValueTuple). There are other type-size factors to consider as well.
    • As things improved drastically from .NET Framework 4.8 to .NET 5 it means that you shouldn't trust these numbers if you're running on .NET 6 or later.
      • Also, don't assume that newer .NET releases will _always) make things faster: there have been times when performance actually worsened with both .NET updates and OS security patches.
// Warmup:
{
    var foo1 = new Dictionary<string, string>();
    var foo2 = new Dictionary<string, string>( capacity: 10_000 );
    foo1.Add( "foo", "bar" );
    foo2.Add( "foo", "bar" );
}


Stopwatch sw = Stopwatch.StartNew();

// Pre-set capacity:
TimeSpan pp_initTime;
TimeSpan pp_populateTime;
{
    var dict1 = new Dictionary<string, string>(1000000);

    pp_initTime = sw.GetElapsedAndRestart();

    for (int i = 0; i < 1000000; i++)
    {
        dict1.Add(i.ToString(), i.ToString());
    }
}
pp_populateTime = sw.GetElapsedAndRestart();

//
TimeSpan empty_initTime;
TimeSpan empty_populateTime;
{
    var dict2 = new Dictionary<string, string>();

    empty_initTime = sw.GetElapsedAndRestart();

    for (int i = 0; i < 1000000; i++)
    {
        dict2.Add(i.ToString(), i.ToString());
    }
}
empty_populateTime = sw.GetElapsedAndRestart();

//

Console.WriteLine("Pre-set capacity. Init time: {0:N3}ms, Fill time: {1:N3}ms, Total time: {2:N3}ms.", pp_initTime.TotalMilliseconds, pp_populateTime.TotalMilliseconds, ( pp_initTime + pp_populateTime ).TotalMilliseconds );
Console.WriteLine("Empty capacity. Init time: {0:N3}ms, Fill time: {1:N3}ms, Total time: {2:N3}ms.", empty_initTime.TotalMilliseconds, empty_populateTime.TotalMilliseconds, ( empty_initTime + empty_populateTime ).TotalMilliseconds );

// Extension methods:

[MethodImpl( MethodImplOptions.AggressiveInlining | MethodImplOptions.AggressiveOptimization )]
public static TimeSpan GetElapsedAndRestart( this Stopwatch stopwatch )
{
    TimeSpan elapsed = stopwatch.Elapsed;
    stopwatch.Restart();
    return elapsed;
}

Original benchmark:

Original benchmark, without cold-startup warmup phase and lower-precision DateTime timing:

  • With capacity (dict1) total time is 1220.778ms (for construction and population).
  • Without capacity (dict2) total time is 1502.490ms (for construction and population).
  • So a capacity saved 320ms (~20%) compared to not setting a capacity.
static void Main(string[] args)
{
    const int ONE_MILLION = 1000000;

    DateTime start1 = DateTime.Now;
    
    {
        var dict1 = new Dictionary<string, string>( capacity: ONE_MILLION  );

        for (int i = 0; i < ONE_MILLION; i++)
        {
            dict1.Add(i.ToString(), i.ToString());
        }
    }
        
    DateTime stop1 = DateTime.Now;
        
    DateTime start2 = DateTime.Now;

    {
        var dict2 = new Dictionary<string, string>();

        for (int i = 0; i < ONE_MILLION; i++)
        {
            dict2.Add(i.ToString(), i.ToString());
        }
    }
        
    DateTime stop2 = DateTime.Now;
        
    Console.WriteLine("Time with size initialized: " + (stop1.Subtract(start1)) + "\nTime without size initialized: " + (stop2.Subtract(start2)));
    Console.ReadLine();
}
Dai
  • 141,631
  • 28
  • 261
  • 374
jhunter
  • 1,852
  • 4
  • 26
  • 37
  • 4
    Interesting. For future reference, you should use the System.Diagnostics.Stopwatch class when measuring times such as these. DateTime.Now will only give you 15ms resolution, but Stopwatch gives something like 0.01ms resolution. – Drew Noakes Jan 05 '09 at 19:45
  • What I want to know is whether specifying a size of, say 2,000,000 and adding 1,000,000 is faster due to the reduced loading and therefore reduced chaining. – Drew Noakes Jan 05 '09 at 19:47
  • Ditto on using System.Diagnostics.Stopwatch as opposed to DateTime.Now – Mitch Wheat Jan 06 '09 at 09:47
  • I get more distinct numbers by adding an initial cold-start-warmup step (due to JITing of different members in Dictionary based on scenario). I've edited the answer rather than posting my own due to my contribution being an improvement to an existing answer instead of adding to noise. – Dai Dec 14 '21 at 02:41
5

I think you're over-complicating matters. If you know how many items will be in your dictionary, then by all means specify that on construction. This will help the dictionary to allocate the necessary space in its internal data structures to avoid reallocating and reshuffling data.

Kent Boogaart
  • 175,602
  • 35
  • 392
  • 393
  • 2
    @StingyJack: not necessarily. For implementation reasons, the dictionary class does not double its storage. Rather, space is allocated to accomodate a prime number of elements because this makes collisions through modulus much rarer. – Konrad Rudolph Jan 05 '09 at 19:11
  • I agree Kent. I should have tagged this question as 'academic'. Dictionaries are key (pun intentional) programming constructs and I like nutting out the trivia on such everyday things as this. My primary question is: does allocating *extra* space reduce collisions and increase performance? – Drew Noakes Jan 05 '09 at 19:58
2

Specifying the initial capacity to the Dictionary constructor increases performance because there will be fewer number of resizes to the internal structures that store the dictionary values during ADD operations.

Considering that you specify a initial capacity of k to the Dictionary constructor then:

  1. The Dictionary will reserve the amount of memory necessary to store k elements;
  2. QUERY performance against the dictionary is not affected and it will not be faster or slower;
  3. ADD operations will not require more memory allocations (perhaps expensive) and thus will be faster.

From MSDN:

The capacity of a Dictionary(TKey, TValue) is the number of elements that can be added to the Dictionary(TKey, TValue) before resizing is necessary. As elements are added to a Dictionary(TKey, TValue), the capacity is automatically increased as required by reallocating the internal array.

If the size of the collection can be estimated, specifying the initial capacity eliminates the need to perform a number of resizing operations while adding elements to the Dictionary(TKey, TValue).

Jorge Ferreira
  • 96,051
  • 25
  • 122
  • 132
  • I agree with the documentation :) Still, what I want to know is whether giving *extra* size will reduce the number of collision resolutions and hence improve performance at the cost of some additional memory wastage. – Drew Noakes Jan 05 '09 at 19:54
  • If you are talking about performance of QUERIES against the dictionary no, it will not be faster. The initial capacity k will reserve the amount of memory necessary to store k elements. ADD operations will not require more memory allocations (perhaps expensive) and thus will be faster. – Jorge Ferreira Jan 06 '09 at 09:24
  • @smink, I don't quite agree with you here. The dictionary's lookup process looks in a 'bucket' based upon the hashcode. Multiple entries might prefer that bucket, but the first to be added gets it. Others are chained, meaning that lookup for those others is not as efficient as for the first. – Drew Noakes Jan 06 '09 at 09:54
  • @smink, furthermore, having a larger initial dictionary size would reduce the number of hashing collisions and therefore reduce the average chain length, improving lookup speeds (though potentially marginally). – Drew Noakes Jan 06 '09 at 09:55
1

Yes, contrary to a HashTable which uses rehashing as the method to resolve collisions, Dictionary will use chaining. So yes, it's good to use the count. For a HashTable you probably want to use count * (1/fillfactor)

Mehrdad Afshari
  • 414,610
  • 91
  • 852
  • 789
  • The distinction between rehashing and chaining is an interesting one to note. Thanks. In either case though, there's still some kind of collision resolution taking place that's going to have *some* impact on performance. Are you saying that this is less when chaining? – Drew Noakes Jan 05 '09 at 19:49
  • It's related to the average length of a chain which in turn is related to number of collisions. – Mitch Wheat Jan 06 '09 at 09:49
  • 1
    Nope, I'm not saying it's less. It depends. But when you use chaining, the storage space used by the links are not counted in the hash table itself which reduces the need of more space if a collision takes place. – Mehrdad Afshari Jan 06 '09 at 10:38
-1

The initial size is just a suggestion. For example, most hash tables like to have sizes that are prime numbers or a power of 2.

Jonathan Allen
  • 68,373
  • 70
  • 259
  • 447
  • A hashtable with a power of 2 size? Does it perform well? – Mehrdad Afshari Jan 05 '09 at 19:20
  • 1
    Primes sound better than powers of 2 to me. The .NET framework (mscorlib.dll v2.0.0.0) calls the internal method HashHelpers.GetPrime to find the next largest prime number after 'capacity'. It searches a cache of primes and performs a brute force search if the capacity is larger than 7,199,369 :) – Drew Noakes Jan 05 '09 at 19:53