How CPU caching works when you get access to different value in a 'for' loop?

Question

I made some tests of code performance, and I would like to know how the CPU cache works in this kind of situation:

Here is a classic example for a loop:

        private static readonly short[] _values;

        static MyClass()
        {
            var random = new Random();
            _values = Enumerable.Range(0, 100)
                                .Select(x => (short)random.Next(5000))
                                .ToArray();
        }

        public static void Run()
        {
            short max = 0;
            for (var index = 0; index < _values.Length; index++)
            {
                max = Math.Max(max, _values[index]);
            }
        }

Here is the specific situation to get the same thing, but much more performant:

        private static readonly short[] _values;

        static MyClass()
        {
            var random = new Random();
            _values = Enumerable.Range(0, 100)
                                .Select(x => (short)random.Next(5000))
                                .ToArray();
        }

        public static void Run()
        {
            short max1 = 0;
            short max2 = 0;
            for (var index = 0; index < _values.Length; index+=2)
            {
                max1 = Math.Max(max1, _values[index]);
                max2 = Math.Max(max2, _values[index + 1]);
            }
            short max = Math.Max(max1, max2);
        }

So I am interested to know why the second is more efficient as the first one. I understand it's a story of CPU cache, but I don't get really how it happens (like values are not read twice between loops).

EDIT:

.NET Core 4.6.27617.04 2.1.11 Intel Core i7-7850HQ 2.90GHz 64-bit

Calling 50 Million of times:

MyClass1: => 00:00:06.0702028

MyClass2: => 00:00:03.8563776 (-36 %)

The last metric are the one with the Loop unrolling.

Actually it's due to [loop unrolling](https://en.wikipedia.org/wiki/Loop_unrolling), not cache. — Patrick Roberts, Jul 19 '19 at 13:16
`but much more performant:` How much more performant? How did you test it (what benchmarking library / software did you use)? Running on what spec machine? .NET Core or Framework? 32-bit or 64-bit? Which OS? — mjwills, Jul 19 '19 at 13:41
The second may throw an exception, depending on the length. Which is one of the reasons why the first _may_ be slower - it is doing better (more frequent) checks. — mjwills, Jul 19 '19 at 13:57
Added some metrics. I continue checking how loop unrolling works. — Dams, Jul 19 '19 at 14:14
@PatrickRoberts, Loop unrolling is also linked to the CPU cache : https://stackoverflow.com/questions/39379650/how-can-a-programs-size-increase-the-rate-of-cache-misses - I understand better now. Thanks :) — Dams, Jul 19 '19 at 14:16
@Dams it's related, sure, but you claim that caching makes it more efficient. That is not the case. Based on your link, the use of loop unrolling actually makes cache misses _more_ likely which would _decrease_ performance, but the performance benefits of unrolling the loop here far outweigh the negative affect on caching due to the slight increase in code size, so CPU caching is unrelated to the performance difference benchmarked here. — Patrick Roberts, Jul 19 '19 at 15:06

score 3 · Accepted Answer · answered Jul 19 '19 at 14:51

The difference in performance in this case is not related to caching - you have just 100 values - they fit entirely in the L2 cache already at the time you generated them.

The difference is due to out-of-order execution.

A modern CPU has multiple execution units and can perform more than one operation at the same time even in a single-threaded application.

But your loop is problematic for a modern CPU because it has a dependency:

        short max = 0;
        for (var index = 0; index < _values.Length; index++)
        {
            max = Math.Max(max, _values[index]);
        }

Here each subsequent iteration is dependent on the value max from the previous one, so the CPU is forced to compute them sequentially.

Your revised loop adds a degree of freedom for the CPU; since max1 and max2 are independent, they can be computed in parallel.

So essentially the revised loop can run equally fast per iteration as the first one:

        short max1 = 0;
        short max2 = 0;
        for (var index = 0; index < _values.Length; index+=2)
        {
            max1 = Math.Max(max1, _values[index]);
            max2 = Math.Max(max2, _values[index + 1]);
        }

But it has half the iterations, so in the end you get a significant speedup (not 2x because out-of-order execution is not perfect).

This is a very nice answer! I wasn't aware that the C# compiler performed optimizations by executing hot loops across multiple execution units concurrently. When I mentioned loop unrolling, I thought this was more related to the fact that `index < _values.Length` is executed half as often, and that because `max1` and `max2` are calculated independently in each iteration, the CPU pipeline has a little extra time to get to the writeback stage for each assignment before the next iteration happens, so the CPU can spend less time waiting on the pipeline to catch up. Is that not accurate? — Patrick Roberts, Jul 19 '19 at 15:16
I made a test, replace "max2" by "max1" inside the calculation for "max2" to create a dependance. And I passing from about 30% at approximatively 10% of gain. So you have a good approach about the most of performance gain about this kind of code. — Dams, Jul 19 '19 at 15:31
I guess so there still a loop unrolling optimization somewhere. — Dams, Jul 19 '19 at 15:32

score -1 · Answer 2 · answered Jul 19 '19 at 13:24

Caching

Caching in the cpu works such as it pre-loads the next few lines of code from memory and stores it in the CPU Cache, This may be data, pointers, variable values, etc. etc.

Code Blocks

between your two blocks of code, the difference may not appear in the syntax, try converting your Code to IL (intermediate runtime language for c# which is executed by JIT(just-in-time compiler)) see ref for tools and resources.

or just decompiler your built/compiled code and check how the compiler "optimized it" when making the dll/exe files using the decompiler below.

How CPU caching works when you get access to different value in a 'for' loop?

2 Answers2

Caching

Code Blocks

other performance optimization

Refs: