C# micro-benchmark: why reseting aggregation value make for-loops faster?

Question

Consider the following two different functions ComputeA and ComnputeB:

using System;
using System.Diagnostics;

namespace BenchmarkLoop
{
    class Program
    {
        private static double[] _dataRow;
        private static double[] _dataCol;

        public static double ComputeA(double[] col, double[] row)
        {
            var rIdx = 0;
            var value = 0.0;

            for (var i = 0; i < col.Length; ++i)
            {
                for (var cIdx = 0; cIdx < col.Length; ++cIdx, ++rIdx)
                    value += col[cIdx] * row[rIdx];
            }

            return value;
        }

        public static double ComputeB(double[] col, double[] row)
        {
            var rIdx = 0;
            var value = 0.0;

            for (var i = 0; i < col.Length; ++i)
            {
                value = 0.0;
                for (var cIdx = 0; cIdx < col.Length; ++cIdx, ++rIdx)
                    value += col[cIdx] * row[rIdx];
            }

            return value;
        }

        public static double ComputeC(double[] col, double[] row)
        {
            var rIdx = 0;
            var value = 0.0;

            for (var i = 0; i < col.Length; ++i)
            {
                var tmp = 0.0;
                for (var cIdx = 0; cIdx < col.Length; ++cIdx, ++rIdx)
                    tmp += col[cIdx] * row[rIdx];
                value += tmp;
            }

            return value;
        }

        static void Main(string[] args)
        {
            _dataRow = new double[2500];
            _dataCol = new double[50];

            var random = new Random();
            for (var i = 0; i < _dataRow.Length; i++)            
                _dataRow[i] = random.NextDouble();
            for (var i = 0; i < _dataCol.Length; i++)
                _dataCol[i] = random.NextDouble();

            var nRuns = 1000000;

            var stopwatch = new Stopwatch();
            stopwatch.Start();
            for (var i = 0; i < nRuns; i++)
                ComputeA(_dataCol, _dataRow);
            stopwatch.Stop();
            var t0 = stopwatch.ElapsedMilliseconds;

            stopwatch.Reset();
            stopwatch.Start();
            for (int i = 0; i < nRuns; i++)
                ComputeC(_dataCol, _dataRow);
            stopwatch.Stop();
            var t1 = stopwatch.ElapsedMilliseconds;

            Console.WriteLine($"Time ComputeA: {t0} - Time ComputeC: {t1}");
            Console.ReadKey();
        }
    }
}

They differ only in the "reset" of variable value before each call to the inner loop. I've run several different kind of benchmarks, all with "Optimized code" enabled, with 32bit and 64bit, and different size of the data arrays. Always, ComputeB is around 25% faster. I can reproduce these results also with BenchmarkDotNet. But I cannot explain them. Any idea? I also checked the resulting assembler code with Intel VTune Amplifier 2019: For both functions the JIT result is exactly the same, plus the extra line to reset value: So, on assembler level there is no magic going on which can make the code faster. Is there any other possible explanation for this effect? And how to verify it?

And here are the results with BenchmarkDotNet (parameter Nis size of _dataCol, _dataRow is always of size N^2):

And the results for comparing ComputeA and ComputeC:

JIT-Assembly for ComputeA (left) and ComputeC (right):

The diff is quite small: in block 2, the variable tmp is set to 0 (stored in register xmml), and in block 6, tmp is added to the returning result value. So, overall, no surprise. Just the runtime is magic ;)

I would expect the outer loop to be completely optimized away in `ComputeB`. — vgru, Nov 22 '18 at 13:46
25% is huge. You are measuring something else. The two methods are *not* the same. The first one calculates the sum of *all* iterations. The second one returns only the sum of the last iteration — Panagiotis Kanavos, Nov 22 '18 at 13:48
@PanagiotisKanavos: Ok, but why is the second computation that much faster? — Thomas W., Nov 22 '18 at 13:59
@Groo: JIT isn't that smart (but even a C++ compiler will not do this!), as you can check from the assembler code (see screenshot) — Thomas W., Nov 22 '18 at 14:01
@ThomasW. which means the *next* optimizer in line, the CPU, probably realizes it can discard the entire loop. Branch prediction was already available in 8086 chips. You can't compare functions that return *different* results. — Panagiotis Kanavos, Nov 22 '18 at 14:05
You are measuring the first run. Put those measured parts into procedure and call it multiple times. The difference would disappear. — Antonín Lejsek, Nov 22 '18 at 14:07
@AntonínLejsek the OP mentioned he can verify the results with BenchmarkDotNet. Although *this* code doesn't use it. — Panagiotis Kanavos, Nov 22 '18 at 14:08
@ThomasW.: you're right, I was pretty sure at gcc would optimize the outer loop away. — vgru, Nov 22 '18 at 15:47
@PanagiotisKanavos: Your argument is right that both functions compute something different. Therefore, I added a function `ComputeC` to this example, which reproduces result from `ComputeA` but makes use of the effect from `ComputeB`. So we are still 25% faster and get the same result. For me this is really interesting and I would like to be able to explain it somehow. — Thomas W., Nov 23 '18 at 13:00
@ThomasW. it's not just an argument. When I tried this code with BenchmarkDotNet ComputeB was consistently **slower** for all N values. The operations are so small that you can't make any meaningful measurements. BenchmarkDotNet even posts warnings under the summary that the results have bad distributions — Panagiotis Kanavos, Nov 23 '18 at 13:15
@ThomasW. btw the code you posted here can't work with BenchmarkDotNet. Post the *actual* code you used. Use the *latest* version of BenchmarkDotNet and include the summary warnings. Finally, if there was such an issue — Panagiotis Kanavos, Nov 23 '18 at 13:17
I find the benchmark results valid. The N=50 case takes multiple milliseconds. This is easily within the range of being measurable. If you want to be safe, increase _dataRow by 100x. This might change cache behavior, though. Can you diff the machine code for A and C? — usr, Nov 23 '18 at 13:28
@PanagiotisKanavos: I added also the results of BenchmarkDotNet to compare `ComputeA` and `ComputeC`. They are statistically significant and the run of the benchmark goes through on my system without warning. — Thomas W., Nov 26 '18 at 19:07
@ThomasW. no they aren't. First, because you *DIDN'T POST YOUR CODE*. How can anyone reproduce or verify the numbers? Second because the warnings appear *after* the summary table. And finally, because on my machine ComputeB is significantly slower. With warnings. But then I have a *desktop* machine, not a laptop (U) CPU. And finally, with timings at the *nanosecond* level, differences may well be due to timer resolutions or other processes causing skews in the results — Panagiotis Kanavos, Nov 27 '18 at 08:20
@ThomasW. and finally, you yourself said the assembly is the same except for a single operation. Which means the optimization is performed by the *CPU*. Something CPUs are known to do, and desktop CPUs do better than laptop CPUs, newer CPUs do better than older CPUs. — Panagiotis Kanavos, Nov 27 '18 at 08:22

C# micro-benchmark: why reseting aggregation value make for-loops faster?

0 Answers0