250

I have a list of 500000 randomly generated Tuple<long,long,string> objects on which I am performing a simple "between" search:

var data = new List<Tuple<long,long,string>>(500000);
...
var cnt = data.Count(t => t.Item1 <= x && t.Item2 >= x);

When I generate my random array and run my search for 100 randomly generated values of x, the searches complete in about four seconds. Knowing of the great wonders that sorting does to searching, however, I decided to sort my data - first by Item1, then by Item2, and finally by Item3 - before running my 100 searches. I expected the sorted version to perform a little faster because of branch prediction: my thinking has been that once we get to the point where Item1 == x, all further checks of t.Item1 <= x would predict the branch correctly as "no take", speeding up the tail portion of the search. Much to my surprise, the searches took twice as long on a sorted array!

I tried switching around the order in which I ran my experiments, and used different seed for the random number generator, but the effect has been the same: searches in an unsorted array ran nearly twice as fast as the searches in the same array, but sorted!

Does anyone have a good explanation of this strange effect? The source code of my tests follows; I am using .NET 4.0.


private const int TotalCount = 500000;
private const int TotalQueries = 100;
private static long NextLong(Random r) {
    var data = new byte[8];
    r.NextBytes(data);
    return BitConverter.ToInt64(data, 0);
}
private class TupleComparer : IComparer<Tuple<long,long,string>> {
    public int Compare(Tuple<long,long,string> x, Tuple<long,long,string> y) {
        var res = x.Item1.CompareTo(y.Item1);
        if (res != 0) return res;
        res = x.Item2.CompareTo(y.Item2);
        return (res != 0) ? res : String.CompareOrdinal(x.Item3, y.Item3);
    }
}
static void Test(bool doSort) {
    var data = new List<Tuple<long,long,string>>(TotalCount);
    var random = new Random(1000000007);
    var sw = new Stopwatch();
    sw.Start();
    for (var i = 0 ; i != TotalCount ; i++) {
        var a = NextLong(random);
        var b = NextLong(random);
        if (a > b) {
            var tmp = a;
            a = b;
            b = tmp;
        }
        var s = string.Format("{0}-{1}", a, b);
        data.Add(Tuple.Create(a, b, s));
    }
    sw.Stop();
    if (doSort) {
        data.Sort(new TupleComparer());
    }
    Console.WriteLine("Populated in {0}", sw.Elapsed);
    sw.Reset();
    var total = 0L;
    sw.Start();
    for (var i = 0 ; i != TotalQueries ; i++) {
        var x = NextLong(random);
        var cnt = data.Count(t => t.Item1 <= x && t.Item2 >= x);
        total += cnt;
    }
    sw.Stop();
    Console.WriteLine("Found {0} matches in {1} ({2})", total, sw.Elapsed, doSort ? "Sorted" : "Unsorted");
}
static void Main() {
    Test(false);
    Test(true);
    Test(false);
    Test(true);
}

Populated in 00:00:01.3176257
Found 15614281 matches in 00:00:04.2463478 (Unsorted)
Populated in 00:00:01.3345087
Found 15614281 matches in 00:00:08.5393730 (Sorted)
Populated in 00:00:01.3665681
Found 15614281 matches in 00:00:04.1796578 (Unsorted)
Populated in 00:00:01.3326378
Found 15614281 matches in 00:00:08.6027886 (Sorted)
Michiel
  • 1,713
  • 3
  • 16
  • 34
Sergey Kalinichenko
  • 714,442
  • 84
  • 1,110
  • 1,523
  • 17
    Because of branch prediction :p – Soner Gönül Dec 24 '12 at 17:12
  • 9
    @jalf I expected the sorted version to perform a little faster because of branch prediction. My thinking was that once we get to the point where `Item1 == x`, all further checks of `t.Item1 <= x` would predict the branch correctly as "no take", speeding up the tail portion of the search. Obviously, that line of thinking has been proven wrong by the harsh reality :) – Sergey Kalinichenko Dec 24 '12 at 17:20
  • Interestingly, for `TotalCount` around `10,000` or less, the sorted version does perform faster (of course, trivially faster at those small numbers) (FYI, your code might want to have the initial size of `var data List = new List>(500000)` bound against `TotalCount` instead of hard-coding the capacity) – Chris Sinclair Dec 24 '12 at 17:37
  • 1
    @ChrisSinclair good observation! I have added an explanation in my answer. – usr Dec 24 '12 at 17:43
  • I'd like to add that the slowdown is specifically connected to *filtering* the list. Performing `data.Where()` shows the same slowdown, as does anything else that iterates over the sorted list. Operating on the sorted and unsorted lists without any filter takes the same time. – Bobson Dec 24 '12 at 17:43
  • While it's a little out of the scope of the question of "why", it may be worth noting that the biggest advantage to pre-sorting the list should be that you can use BinarySearch() on it and achieve O(log n) performance on your searches. – Mark Peters Dec 24 '12 at 19:37
  • 41
    **This question is _NOT_ a duplicate** of an existing question here. **Do not vote to close it as one.** – ThiefMaster Dec 25 '12 at 20:56
  • a contradiction http://stackoverflow.com/q/11227809/992665 – Sar009 Dec 27 '12 at 05:54
  • 3
    @Sar009 Not at all! The two questions consider two very different scenarios, quite naturally arriving to different results. – Sergey Kalinichenko Dec 27 '12 at 10:58
  • 1
    Not related to your question, but you create a class `TupleComparer` but that is entirely unnecessary since `Comparer>.Default` already has this behavior (from the `IComparable` implementation of `Tuple<,,>`). So you can just use `data.Sort()` with no arguments. – Jeppe Stig Nielsen Aug 09 '13 at 20:49
  • 1
    http://stackoverflow.com/questions/11227809/why-is-processing-a-sorted-array-faster-than-an-unsorted-array#comment14829350_11227809 I'm surprised that there a sorted array is faster – puretppc Jan 26 '14 at 16:24
  • Does this answer your question? [Why is processing a sorted array faster than processing an unsorted array?](https://stackoverflow.com/questions/11227809/why-is-processing-a-sorted-array-faster-than-processing-an-unsorted-array) – Starship - On Strike Jun 15 '23 at 21:17
  • @Starship-OnStrike [no](https://stackoverflow.com/questions/14023988/why-is-processing-a-sorted-array-slower-than-an-unsorted-array?noredirect=1#comment19386465_14023988) – Sergey Kalinichenko Jun 15 '23 at 21:33

2 Answers2

282

When you are using the unsorted list all tuples are accessed in memory-order. They have been allocated consecutively in RAM. CPUs love accessing memory sequentially because they can speculatively request the next cache line so it will always be present when needed.

When you are sorting the list you put it into random order because your sort keys are randomly generated. This means that the memory accesses to tuple members are unpredictable. The CPU cannot prefetch memory and almost every access to a tuple is a cache miss.

This is a nice example for a specific advantage of GC memory management: data structures which have been allocated together and are used together perform very nicely. They have great locality of reference.

The penalty from cache misses outweighs the saved branch prediction penalty in this case.

Try switching to a struct-tuple. This will restore performance because no pointer-dereference needs to occur at runtime to access tuple members.

Chris Sinclair notes in the comments that "for TotalCount around 10,000 or less, the sorted version does perform faster". This is because a small list fits entirely into the CPU cache. The memory accesses might be unpredictable but the target is always in cache. I believe there is still a small penalty because even a load from cache takes some cycles. But that seems not to be a problem because the CPU can juggle multiple outstanding loads, thereby increasing throughput. Whenever the CPU hits a wait for memory it will still speed ahead in the instruction stream to queue as many memory operations as it can. This technique is used to hide latency.

This kind of behavior shows how hard it is to predict performance on modern CPUs. The fact that we are only 2x slower when going from sequential to random memory access tell me how much is going on under the covers to hide memory latency. A memory access can stall the CPU for 50-200 cycles. Given that number one could expect the program to become >10x slower when introducing random memory accesses.

usr
  • 168,620
  • 35
  • 240
  • 369
  • 6
    Good reason why everything you learn in C/C++ doesn't apply verbatim to a language like C#! – user541686 Dec 24 '12 at 17:48
  • 38
    You can confirm this behavior by manually copying the sorted data into a `new List>(500000)` one-by-one before testing that new list. In this scenario, the sorted test is just as fast as the unsorted one, which matches with the reasoning on this answer. – Bobson Dec 24 '12 at 17:52
  • 3
    Excellent, thank you very much! I made an equivalent `Tuple` struct, and the program started behaving the way I predicted: the sorted version was a little faster. Moreover, the unsorted version became twice as fast! So the numbers with `struct` are 2s unsorted vs. 1.9s sorted. – Sergey Kalinichenko Dec 24 '12 at 21:31
  • @Mehrdad: Depending on specifics of allocator the consequent values of memory in C/C++ can also be one after another so it does apply to some extend (and you still can use the same optimization). – Maciej Piechotka Dec 24 '12 at 21:57
  • 2
    So can we deduce from this that cache-miss hurts more than branch-mispredication? I think so, and always thought so. In C++, `std::vector` almost always performs better than `std::list`. – Nawaz Dec 25 '12 at 05:57
  • 4
    @Mehrdad: No. This is true for C++ also. Even in C++, compact data structures are fast. Avoiding cache-miss is as important in C++ as in any other language. `std::vector` vs `std::list` is a good example. – Nawaz Dec 25 '12 at 06:00
  • >> So the numbers with struct are 2s unsorted vs. 1.9s sorted << And with a proper algorithm (binary search) the sorted array searches will drop to few milliseconds. I see this as another proof, that proper algorithm is way more important, and one should think more, before writing code, that performs hidden loops like this LINQ query – nsimeonov Feb 01 '14 at 23:00
  • @usr, i've always wondered how/where does one learn about internals like this? – Stan R. Mar 06 '15 at 22:59
  • 1
    @StanR. I subscribe to a lot of blogs and from time to time there is an article about such things. I rarely learn from tutorial-style content or from books. Over time one reads about pretty much everything. – usr Mar 07 '15 at 09:55
  • @usr thanks for the reply, i usually only read Eric Lippert and Jon Skeet blogs, do you have any recommendations for "internals" or "systems programming" type of blogs that are interesting to follow? – Stan R. Mar 09 '15 at 21:19
  • @StanR. I have hundreds in my Feedly and none of them is essential. Just add every good blog that you come across. Following the "blog roll" is sometimes a good idea, too. – usr Mar 10 '15 at 08:07
  • You can add one more points. On systems with a large memory (Free RAM >> List Size) dynamic allocation will attempt to populate the same page, thereby hiding this latency. – xyz Jul 15 '15 at 13:44
4

LINQ doesn't know whether you list is sorted or not.

Since Count with predicate parameter is extension method for all IEnumerables, I think it doesn't even know if it's running over the collection with efficient random access. So, it simply checks every element and Usr explained why performance got lower.

To exploit performance benefits of sorted array (such as binary search), you'll have to do a little bit more coding.

Emperor Orionii
  • 703
  • 10
  • 22
  • 5
    I think you misunderstood the question: of course I wasn't hoping that `Count` or `Where` would "somehow" pick up on the idea that my data is sorted, and run a binary search instead of a plain "check everything" search. All I was hoping for was some improvement due to the better branch prediction (see the link inside my question), but as it turns out, locality of reference trumps branch prediction big time. – Sergey Kalinichenko Dec 25 '12 at 16:12