Why .NET group by is (much) slower when the number of buckets grows

Question

Given this simple piece of code and 10mln array of random numbers:

static int Main(string[] args)
    {
        int size = 10000000;
        int num =  10; //increase num to reduce number of buckets
        int numOfBuckets = size/num;
        int[] ar = new int[size];
        Random r = new Random(); //initialize with randum numbers
        for (int i = 0; i < size; i++)
            ar[i] = r.Next(size);

        var s = new Stopwatch();
        s.Start();
        var group = ar.GroupBy(i => i / num);
        var l = group.Count();
        s.Stop();

        Console.WriteLine(s.ElapsedMilliseconds);
        Console.ReadLine();
        return 0;
    }

I did some performance on grouping, so when the number of buckets is 10k the estimated execution time is 0.7s, for 100k buckets it is 2s, for 1m buckets it is 7.5s.

I wonder why is that. I imagine that if the GroupBy is implemented using HashTable there might be problem with collisions. For example initially the hashtable is prepard to work for let's say 1000 groups and then when the number of groups is growing it needs to increase the size and do the rehashing. If these was the case I could then write my own grouping where I would initialize the HashTable with expected number of buckets, I did that but it was only slightly faster.

So my question is, why number of buckets influences groupBy performance that much?

EDIT: running under release mode change the results to 0.55s, 1.6s, 6.5s respectively.

I also changed the group.ToArray to piece of code below just to force execution of grouping :

foreach (var g in group)
    array[g.Key] = 1;

where array is initialized before timer with appropriate size, the results stayed almost the same.

EDIT2: You can see the working code from mellamokb in here pastebin.com/tJUYUhGL

With the expression `i / numOfBuckets`, you actually get *more* buckets as `numOfBuckets` gets smaller. Did you maybe mean `i % numOfBuckets`? — mellamokb, Apr 02 '14 at 14:34
Show us the code you're testing, including the timing code. In addition, tell us how you're running the test. Unless you're timing a release build without the debugger attached, your timings will be highly suspect. — Jim Mischel, Apr 02 '14 at 14:43
I tested this and got the following timings: 10 buckets=329.9 ms, 100 buckets=352.1 ms, 1000 buckets=475.2 ms, 10000 buckets=628.2 ms, 100000 buckets=1659.6 ms. My timings trend in the opposite direction of yours. Here is my test code: http://pastebin.com/tJUYUhGL — mellamokb, Apr 02 '14 at 15:44
I just realized I have written incorrect number of buckets. So everything looks the same, make a test for 1 mln buckets — rank1, Apr 02 '14 at 15:59
So You got exactly the same results as me 628 ms instead of 550, and 1659 instead of 1600, and You will get around 6000ms for 1m buckets. Sorry for the number of buckets confusion at the begining, however the following logic was for correct number of buckets. — rank1, Apr 02 '14 at 16:10
You need to show us your complete test program, including the code that you use for timing (i.e. `Stopwatch`). I should be able to copy your method into my test program and run it. Your discussion of "number of buckets" was confusing to start with, and your "correction" just made it more confusing. I *think* I know what your question is, but it's hard to be sure because your terminology is inconsistent. If you want help figuring this out, you need to provide a [SSCCE](http://www.sscce.org/), and make sure that the text of your question is consistent with the example. — Jim Mischel, Apr 02 '14 at 19:52
Code from mellamokb http://pastebin.com/tJUYUhGL is almost the same as I have, and he received the same timing. I think everything is clear now. — rank1, Apr 02 '14 at 20:00
First problem with your timing: you're timing how long it takes to dump everything to an array. That's extra overhead you don't need. Just call `Count()` to iterate over everything. — Jon Skeet, Apr 09 '14 at 18:08
It would really help if your code could be complete - something we could just copy into a new file, compile and run - not LINQPad or anything like that, just plain C#. I'm getting there, but it would be easier if it was just ready... — Jon Skeet, Apr 09 '14 at 18:57
@Jon Skeet - I changed to use Count() - results stayed almost the same. I pasted full Main function into question — rank1, Apr 10 '14 at 09:19
@rank1: Right. I've done a bit of investigation into this - including removing LINQ entirely - and I strongly suspect this is cache based. I don't have anything like a "full answer" but I'd be happy to post my observations so far as an answer if you'd be interested. — Jon Skeet, Apr 10 '14 at 09:28

score 16 · Accepted Answer · edited May 27 '21 at 06:28

I'm pretty certain this is showing the effects of memory locality (various levels of caching) and also object allocation.

To verify this, I took three steps:

Improve the benchmarking to avoid unnecessary parts and to garbage collect between tests
Remove the LINQ part by populating a Dictionary (which is effecively what GroupBy does behind the scenes)
Remove even Dictionary<,> and show the same trend for plain arrays.

In order to show this for arrays, I needed to increase the input size, but it does show the same kind of growth.

Here's a short but complete program which can be used to test both the dictionary and the array side - just flip which line is commented out in the middle:

using System;
using System.Collections.Generic;
using System.Diagnostics;

class Test
{    
    const int Size = 100000000;
    const int Iterations = 3;
    
    static void Main()
    {
        int[] input = new int[Size];
        // Use the same seed for repeatability
        var rng = new Random(0);
        for (int i = 0; i < Size; i++)
        {
            input[i] = rng.Next(Size);
        }
        
        // Switch to PopulateArray to change which method is tested
        Func<int[], int, TimeSpan> test = PopulateDictionary;
        
        for (int buckets = 10; buckets <= Size; buckets *= 10)
        {
            TimeSpan total = TimeSpan.Zero;
            for (int i = 0; i < Iterations; i++)
            {
                // Switch which line is commented to change the test
                // total += PopulateDictionary(input, buckets);
                total += PopulateArray(input, buckets);
                GC.Collect();
                GC.WaitForPendingFinalizers();
            }
            Console.WriteLine("{0,9}: {1,7}ms", buckets, (long) total.TotalMilliseconds);
        }
    }
    
    static TimeSpan PopulateDictionary(int[] input, int buckets)
    {
        int divisor = input.Length / buckets;
        var dictionary = new Dictionary<int, int>(buckets);
        var stopwatch = Stopwatch.StartNew();
        foreach (var item in input)
        {
            int key = item / divisor;
            int count;
            dictionary.TryGetValue(key, out count);
            count++;
            dictionary[key] = count;
        }
        stopwatch.Stop();
        return stopwatch.Elapsed;
    }

    static TimeSpan PopulateArray(int[] input, int buckets)
    {
        int[] output = new int[buckets];
        int divisor = input.Length / buckets;
        var stopwatch = Stopwatch.StartNew();
        foreach (var item in input)
        {
            int key = item / divisor;
            output[key]++;
        }
        stopwatch.Stop();
        return stopwatch.Elapsed;
    }
}

Results on my machine:

PopulateDictionary:

       10:   10500ms
      100:   10556ms
     1000:   10557ms
    10000:   11303ms
   100000:   15262ms
  1000000:   54037ms
 10000000:   64236ms // Why is this slower? See later.
100000000:   56753ms

PopulateArray:

       10:    1298ms
      100:    1287ms
     1000:    1290ms
    10000:    1286ms
   100000:    1357ms
  1000000:    2717ms
 10000000:    5940ms
100000000:    7870ms

An earlier version of PopulateDictionary used an Int32Holder class, and created one for each bucket (when the lookup in the dictionary failed). This was faster when there was a small number of buckets (presumably because we were only going through the dictionary lookup path once per iteration instead of twice) but got significantly slower, and ended up running out of memory. This would contribute to fragmented memory access as well, of course. Note that PopulateDictionary specifies the capacity to start with, to avoid effects of data copying within the test.

The aim of using the PopulateArray method is to remove as much framework code as possible, leaving less to the imagination. I haven't yet tried using an array of a custom struct (with various different struct sizes) but that may be something you'd like to try too.

EDIT: I can reproduce the oddity of the slower result for 10000000 than 100000000 at will, regardless of test ordering. I don't understand why yet. It may well be specific to the exact processor and cache I'm using...

--EDIT--

The reason why 10000000 is slower than the 100000000 results has to do with the way hashing works. A few more tests explain this.

First off, let's look at the operations. There's Dictionary.FindEntry, which is used in the [] indexing and in Dictionary.TryGetValue, and there's Dictionary.Insert, which is used in the [] indexing and in Dictionary.Add. If we would just do a FindEntry, the timings would go up as we expect it:

static TimeSpan PopulateDictionary1(int[] input, int buckets)
{
    int divisor = input.Length / buckets;
    var dictionary = new Dictionary<int, int>(buckets);
    var stopwatch = Stopwatch.StartNew();
    foreach (var item in input)
    {
        int key = item / divisor;
        int count;
        dictionary.TryGetValue(key, out count);
    }
    stopwatch.Stop();
    return stopwatch.Elapsed;
}

This is implementation doesn't have to deal with hash collisions (because there are none), which makes the behavior as we expect it. Once we start dealing with collisions, the timings start to drop. If we have as much buckets as elements, there are obviously less collisions... To be exact, we can figure out exactly how many collisions there are by doing:

static TimeSpan PopulateDictionary(int[] input, int buckets)
{
    int divisor = input.Length / buckets;
    int c1, c2;
    c1 = c2 = 0;
    var dictionary = new Dictionary<int, int>(buckets);
    var stopwatch = Stopwatch.StartNew();
    foreach (var item in input)
    {
        int key = item / divisor;
        int count;
        if (!dictionary.TryGetValue(key, out count))
        {
            dictionary.Add(key, 1);
            ++c1;
        }
        else
        {
            count++;
            dictionary[key] = count;
            ++c2;
        }
    }
    stopwatch.Stop();
    Console.WriteLine("{0}:{1}", c1, c2);
    return stopwatch.Elapsed;
}

The result is something like this:

10:99999990
       10:    4683ms
100:99999900
      100:    4946ms
1000:99999000
     1000:    4732ms
10000:99990000
    10000:    4964ms
100000:99900000
   100000:    7033ms
1000000:99000000
  1000000:   22038ms
9999538:90000462       <<-
 10000000:   26104ms
63196841:36803159      <<-
100000000:   25045ms

Note the value of '36803159'. This answers the question why the last result is faster than the first result: it simply has to do less operations -- and since caching fails anyways, that factor doesn't make a difference anymore.

Perhaps it's simply faster because Intel Speedstep kicked in, boosting the frequency of a single CPU core. I've noticed the same behavior during some of my own tests. — atlaste, Apr 11 '14 at 08:35
@StefandeBruijn: While that's possible, I suspect it's unlikely in this case - I'd been running lots of tests by this time, so I'd have expected Speedstep to have kicked in a lot earlier if it was going to. I'll see whether I can reproduce it though - it's interesting. — Jon Skeet, Apr 11 '14 at 10:11
just tested it on my FixBucketTest implementation; I cannot reproduce your results here. I do use different dictionary calls, but afaik that shouldn't make a difference since they all end up with the same FindEntry/Insert calls. Still, it is interesting, so if you find something please let me know. — atlaste, Apr 11 '14 at 11:35
@StefandeBruijn: Can you try with the exact code I've posted? I can reproduce the timings of those three slow tests regardless of order - the one with 10000000 buckets is always ~65 seconds, and the ones with 1000000 and 100000000 buckets are always ~55 seconds. It's very odd. — Jon Skeet, Apr 11 '14 at 17:55
Very weird. Just tested with your exact code; on my pc it just goes up while using the .NET 4 x64 JIT'ter. (Intel i5, 8 GB mem). Timings: 64s, 77s, 109s. The .NET 4.5 x64 JIT'ter gives quite different timings though: 71s, 101s, 107s. I've noticed a few times before that the 4.5 JIT'ter does strange things though (and is usually slower); can you confirm with the 4.0 JIT'ter? — atlaste, Apr 12 '14 at 12:15
@StefandeBruijn: Certainly, if you'd give me some instructions as to switching JIT. (I'm used to switching between x86 and x64, but not switching JITs within that...) — Jon Skeet, Apr 12 '14 at 12:17
Jon, I'll simply assume you have 4 installed as well. Go to properties of your console application, tab 'Application', select target framework '.NET Framework 4'. Check that it's in Release mode. Recompile, start without debugging. Should do the trick. — atlaste, Apr 12 '14 at 12:33
Just tested it on my server; it gives the same behavior as your results, regardless of the .NET runtime. The plot thickens... I have a few hunches that I'm going to test (like the dictionary randomizer and the way you force a GC). If I find something, I'll let you know. — atlaste, Apr 12 '14 at 12:55
@StefandeBruijn: You're assuming I've got a VS project - I haven't :) Obviously I can create one. But I was under the impression that 4.5 was installed "on top of" 4.5... — Jon Skeet, Apr 12 '14 at 13:10
Jon, doesn't matter, I figured it out. :-) I'll add it to the post. — atlaste, Apr 12 '14 at 13:15
@StefandeBruijn: Thanks. So we're still reckoning that caches are responsible for the overall time increase? — Jon Skeet, Apr 12 '14 at 13:44
Thank You guys. Both answers earned acceptation however I could accept only one. — rank1, Apr 14 '14 at 10:00

Hans Passant · Answer 2 · 2014-04-14T21:09:56.897

10k the estimated execution time is 0.7s, for 100k buckets it is 2s, for 1m buckets it is 7.5s.

This is an important pattern to recognize when you profile code. It is one of the standard size vs execution time relationships in software algorithms. Just from seeing the behavior, you can tell a lot about the way the algorithm was implemented. And the other way around of course, from the algorithm you can predict the expected execution time. A relationship that's annotated in the Big Oh notation.

Speediest code you can get is amortized O(1), execution time barely increases when you double the size of the problem. The Dictionary<> class behaves that way, as John demonstrated. The increases in time as the problem set gets large is the "amortized" part. A side-effect of Dictionary having to perform linear O(n) searches in buckets that keep getting bigger.

A very common pattern is O(n). That tells you that there is a single for() loop in the algorithm that iterates over the collection. O(n^2) tells you there are two nested for() loops. O(n^3) has three, etcetera.

What you got is the one in between, O(log n). It is the standard complexity of a divide-and-conquer algorithm. In other words, each pass splits the problem in two, continuing with the smaller set. Very common, you see it back in sorting algorithms. Binary search is the one you find back in your text book. Note how log₂(10) = 3.3, very close to the increment you see in your test. Perf starts to tank a bit for very large sets due to the poor locality of reference, a cpu cache problem that's always associated with O(log n) algoritms.

The one thing that John's answer demonstrates is that his guess cannot be correct, GroupBy() certainly does not use a Dictionary<>. And it is not possible by design, Dictionary<> cannot provide an ordered collection. Where GroupBy() must be ordered, it says so in the MSDN Library:

The IGrouping objects are yielded in an order based on the order of the elements in source that produced the first key of each IGrouping. Elements in a grouping are yielded in the order they appear in source.

Not having to maintain order is what makes Dictionary<> fast. Keeping order always cost O(log n), a binary tree in your text book.

Long story short, if you don't actually care about order, and you surely would not for random numbers, then you don't want to use GroupBy(). You want to use a Dictionary<>.

So the one thing that no one did notice is that GroupBy must return values ordered which is a great point! However sorting even 1mln integer random elements takes less than 100ms, so it is quite small compared to few seconds increase that we are seeing. So is it a size of the elements that makes sorting so slow ? (and again sth probably to do with caching?) — rank1, Apr 15 '14 at 13:44
I'm not going to make a judgement based on a benchmark I cannot see. You are definitely too unappreciative of O(log n) complexity. It is the good kind to have. You make the problem a thousand times bigger, the program is only ten times slower. That's pretty doggone good, not a lot of algorithms can do that. Certainly not sorting, it is amortized O(n x log n). — Hans Passant, Apr 15 '14 at 13:54
@HansPassant You're wrong, groupings use hash based matching. They don't have to be sorted, they just have to be stored in the group in the same order as they are added. The amortized time of a grouping is definitely O(1). — atlaste, Apr 16 '14 at 07:30
An attempt to explain how it works: Imagine having a hash table (int[]) with a separate array of the key-values (keyvaluepair[]). You can enumerate a dictionary by enumerating the hash, or enumerating the array the key/value pairs are stored in. As long as elements to the array are added in the same order as they appear in the source, the condition holds. Add is amortized O(1) because it requires an insert in the hash and in the keyvalue array, which are both O(1). Binary trees or sorting has nothing to do with it. — atlaste, Apr 16 '14 at 07:34
The need to keep order is definitely poorly understood. .NET doesn't have a collection class that keeps order but still provides less than O(n) lookup, might have something to do with it. SortedList is closest with O(log n) inserts and O(log n) lookup but it re-orders. There's a good reason for .NET not having such a collection, it performs poorly when the input is not random enough. — Hans Passant, Apr 16 '14 at 10:43
@HansPassant I understand keeping order perfectly thank you :-) You're just mistaking the concepts of "keeping the keys in order of insertion" with "sorting by key". The former is used by grouping (or if you like `Lookup.GetGrouping`), while you're talking about the latter. Simply put: group by **does not sort nor does it keep the keys ordered by any comparison operation**, it just appends a key to a list iff the key is new. If it's not clear, I'd be happy to add some code that illustrates how it works. — atlaste, Apr 22 '14 at 07:04

score 3 · Answer 3 · answered Apr 08 '14 at 16:16

There are (at least) two influence factors: First, a hash table lookup only takes O(1) if you have a perfect hash function, which does not exist. Thus, you have hash collisions.

I guess more important, though, are caching effects. Modern CPUs have large caches, so for the smaller bucket count, the hash table itself might fit into the cache. As the hash table is frequently accessed, this might have a strong influence on the performance. If there are more buckets, more accesses to the RAM might be neccessary, which are slow compared to a cache hit.

I would also think that this might have sth to do with caching. — rank1, Apr 09 '14 at 10:10

score 3 · Answer 4 · answered Apr 11 '14 at 08:34

There are a few factors at work here.

Hashes and groupings

The way grouping works is by creating a hash table. Each individual group then supports an 'add' operation, which adds an element to the add list. To put it bluntly, it's like a Dictionary<Key, List<Value>>.

Hash tables are always overallocated. If you add an element to the hash, it checks if there is enough capacity, and if not, recreates the hash table with a larger capacity (To be exact: new capacity = count * 2 with count the number of groups). However, a larger capacity means that the bucket index is no longer correct, which means you have to re-build the entries in the hash table. The Resize() method in Lookup<Key, Value> does this.

The 'groups' themselves work like a List<T>. These too are overallocated, but are easier to reallocate. To be precise: the data is simply copied (with Array.Copy in Array.Resize) and a new element is added. Since there's no re-hashing or calculation involved, this is quite a fast operation.

The initial capacity of a grouping is 7. This means, for 10 elements you need to reallocate 1 time, for 100 elements 4 times, for 1000 elements 8 times, and so on. Because you have to re-hash more elements each time, your code gets a bit slower each time the number of buckets grows.

I think these overallocations are the largest contributors to the small growth in the timings as the number of buckets grow. The easiest way to test this theory is to do no overallocations at all (test 1), and simply put counters in an array. The result can be shown below in the code for FixArrayTest (or if you like FixBucketTest which is closer to how groupings work). As you can see, the timings of # buckets = 10...10000 are the same, which is correct according to this theory.

Cache and random

Caching and random number generators aren't friends.

Our little test also shows that when the number of buckets grows above a certain threshold, memory comes into play. On my computer this is at an array size of roughly 4 MB (4 * number of buckets). Because the data is random, random chunks of RAM will be loaded and unloaded into the cache, which is a slow process. This is also the large jump in the speed. To see this in action, change the random numbers to a sequence (called 'test 2'), and - because the data pages can now be cached - the speed will remain the same overall.

Note that hashes overallocate, so you will hit the mark before you have a million entries in your grouping.

Test code

static void Main(string[] args)
{
    int size = 10000000;
    int[] ar = new int[size];

    //random number init with numbers [0,size-1]
    var r = new Random();
    for (var i = 0; i < size; i++)
    {
        ar[i] = r.Next(0, size);
        //ar[i] = i; // Test 2 -> uncomment to see the effects of caching more clearly
    }

    Console.WriteLine("Fixed dictionary:");
    for (var numBuckets = 10; numBuckets <= 1000000; numBuckets *= 10)
    {
        var num = (size / numBuckets);
        var timing = 0L;
        for (var i = 0; i < 5; i++)
        {
            timing += FixBucketTest(ar, num);
            //timing += FixArrayTest(ar, num); // test 1
        }
        var avg = ((float)timing) / 5.0f;

        Console.WriteLine("Avg Time: " + avg + " ms for " + numBuckets);
    }

    Console.WriteLine("Fixed array:");
    for (var numBuckets = 10; numBuckets <= 1000000; numBuckets *= 10)
    {
        var num = (size / numBuckets);
        var timing = 0L;
        for (var i = 0; i < 5; i++)
        {
            timing += FixArrayTest(ar, num); // test 1
        }
        var avg = ((float)timing) / 5.0f;

        Console.WriteLine("Avg Time: " + avg + " ms for " + numBuckets);
    }
}

static long FixBucketTest(int[] ar, int num)
{
    // This test shows that timings will not grow for the smaller numbers of buckets if you don't have to re-allocate
    System.Diagnostics.Stopwatch s = new Stopwatch();
    s.Start();
    var grouping = new Dictionary<int, List<int>>(ar.Length / num + 1); // exactly the right size
    foreach (var item in ar)
    {
        int idx = item / num;
        List<int> ll;
        if (!grouping.TryGetValue(idx, out ll))
        {
            grouping.Add(idx, ll = new List<int>());
        }
        //ll.Add(item); //-> this would complete a 'grouper'; however, we don't want the overallocator of List to kick in
    }
    s.Stop();
    return s.ElapsedMilliseconds;
}

// Test with arrays
static long FixArrayTest(int[] ar, int num)
{
    System.Diagnostics.Stopwatch s = new Stopwatch();
    s.Start();

    int[] buf = new int[(ar.Length / num + 1) * 10];
    foreach (var item in ar)
    {
        int code = (item & 0x7FFFFFFF) % buf.Length;
        buf[code]++;
    }

    s.Stop();
    return s.ElapsedMilliseconds;
}

Nick Prozee · Answer 5 · 2014-04-08T11:03:04.300

When executing bigger calculations, less physical memory is available on the computer, counting the buckets will be slower with less memory, as you expend the buckets, your memory will decrease.

Try something like the following:

int size = 2500000; //10000000 divided by 4
int[] ar = new int[size];
//random number init with numbers [0,size-1]
System.Diagnostics.Stopwatch s = new Stopwatch();
s.Start();

for (int i = 0; i<4; i++)
{
var group = ar.GroupBy(i => i / num); 
//the number of expected buckets is size / num.
var l = group.ToArray();
}

s.Stop();

calcuting 4 times with lower numbers.

Why .NET group by is (much) slower when the number of buckets grows

5 Answers5