C# foreach performance vs memory fragmentation

Question

Tracking down a performance problem (micro I know) I end with this test program. Compiled with the framework 4.5 and Release mode it tooks on my machine around 10ms.

What bothers me if that if I remove this line

public int[] value1 = new int[80];

times get closer to 2 ms. It seems that there is some memory fragmentation problem but I failed to explain the why. I have tested the program with Net Core 2.0 with same results. Can anyone explain this behaviour?

using System;
using System.Collections.Generic;
using System.Diagnostics;

namespace ConsoleApp4
{

    public class MyObject
    {
        public int value = 1;
        public int[] value1 = new int[80];
    }


    class Program
    {
        static void Main(string[] args)
        {

            var list = new List<MyObject>();
            for (int i = 0; i < 500000; i++)
            {
                list.Add(new MyObject());
            }

            long total = 0;
            for (int i = 0; i < 200; i++)
            {
                int counter = 0;
                Stopwatch timer = Stopwatch.StartNew();

                foreach (var obj in list)
                {
                    if (obj.value == 1)
                        counter++;
                }

                timer.Stop();
                total += timer.ElapsedMilliseconds;
            }

            Console.WriteLine(total / 200);

            Console.ReadKey();
        }
    }
}

UPDATE:

After some research I came to the conclusion that it's just the processor cache access time. Using the VS profiler, the cache misses seem to be a lot higher

Without array

With array

Side note: `var list = new List(500000);` - let's allocate memory for the list *once* — Dmitry Bychenko, Jan 16 '19 at 07:02
There are a lot of "surrounding circumstances" (like GC and JIT compiler) that can (drastically) change between runs, so you might want to use a better suited benchmarking tool than just `Stopwatch`. You could use [BenchmarkDotNet](https://benchmarkdotnet.org/) (no affiliation) for example. — Corak, Jan 16 '19 at 07:23
use a struct and an array to hold them, instant performance boost — TheGeneral, Jan 16 '19 at 07:29

score 1 · Answer 1 · answered Jan 16 '19 at 07:07

There are several implications involved.

When you have your line public int[] value1 = new int[80];, you have one extra allocation of memory: a new array is created on a heap which will accommodate 80 integers (320 bytes) + overhead of the class. You do 500 000 of these allocations.

These allocations total up for more than 160 MBs of RAM, which may cause the GC to kick in and see if there is memory to be released.

Further, when you allocate so much memory, it is likely that some of the objects from the list are not retained in the CPU cache. When you later enumerate your collection, the CPU may need to read the data from RAM, not from cache, which will induce a serious performance penalty.

@BrianRasmussen, I am merely pointing out how much memory gets allocated. This impacts the cache in my opinion. Please see my comment to your answer. — Nick, Jan 16 '19 at 09:07

score 1 · Answer 2 · answered Jan 16 '19 at 07:28

I'm not able to reproduce a big difference between the two and I wouldn't expect it either. Below are the results I get on .NET Core 2.2.

Instances of MyObject will be allocated on the heap. In one case, you have an int and a reference to the int array. In the other you have just the int. In both cases, you need to do the additional work of following the reference from the list. That is the same in both cases and the compiled code shows this.

Branch prediction will affect how fast this runs, but since you're branching on the same condition every time I wouldn't expect this to change from run to run (unless you change the data).

BenchmarkDotNet=v0.11.3, OS=Windows 10.0.17134.376 (1803/April2018Update/Redstone4)
Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=2.2.200-preview-009648
  [Host]     : .NET Core 2.2.0 (CoreCLR 4.6.27110.04, CoreFX 4.6.27110.04), 64bit RyuJIT
  DefaultJob : .NET Core 2.2.0 (CoreCLR 4.6.27110.04, CoreFX 4.6.27110.04), 64bit RyuJIT


       Method |   size |     Mean |     Error |    StdDev | Ratio |
------------- |------- |---------:|----------:|----------:|------:|
    WithArray | 500000 | 8.167 ms | 0.0495 ms | 0.0463 ms |  1.00 |
 WithoutArray | 500000 | 8.167 ms | 0.0454 ms | 0.0424 ms |  1.00 |

For reference:

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Collections.Generic;

namespace CoreSandbox
{
    [DisassemblyDiagnoser(printAsm: true, printSource: false, printPrologAndEpilog: true, printIL: false, recursiveDepth: 1)]
    //[MemoryDiagnoser]
    public class Test
    {
        private List<MyObject> dataWithArray;
        private List<MyObjectLight> dataWithoutArray;

        [Params(500_000)]
        public int size;

        public class MyObject
        {
            public int value = 1;
            public int[] value1 = new int[80];
        }

        public class MyObjectLight
        {
            public int value = 1;
        }

        static void Main(string[] args)
        {
            var summary = BenchmarkRunner.Run<Test>();
        }

        [GlobalSetup]
        public void Setup()
        {
            dataWithArray = new List<MyObject>(size);
            dataWithoutArray = new List<MyObjectLight>(size);

            for (var i = 0; i < size; i++)
            {
                dataWithArray.Add(new MyObject());
                dataWithoutArray.Add(new MyObjectLight());
            }
        }

        [Benchmark(Baseline = true)]
        public int WithArray()
        {
            var counter = 0;

            foreach(var obj in dataWithArray)
            {
                if (obj.value == 1)
                    counter++;
            }

            return counter;
        }

        [Benchmark]
        public int WithoutArray()
        {
            var counter = 0;

            foreach (var obj in dataWithoutArray)
            {
                if (obj.value == 1)
                    counter++;
            }

            return counter;
        }

    }
}

I am not sure how the cache influences the benchmarks. The OP has one single run only. He first allocates the memory, then accesses all objects in sequence. This gets high changes that objects may need to be retrieved from RAM, not from the cache. Running the benchmark in a sequence on the access of the collections only increases dramatically the chance to get all the collection in the cache and thus the two methods show pretty equal results. — Nick, Jan 16 '19 at 09:11
@Nick BenchmarkDotNet goes a long way to reduce random noise from its measurements. The test above measures performance of access without taking whatever happened before into account. — Brian Rasmussen, Jan 16 '19 at 09:17
@BrianRasmussen I can't check your benchmark rigth now, I will try later. My program was executed plenty of times in different machines. I think it's just the processor cache. — danijepg, Jan 16 '19 at 10:01
@BrianRasmussen, there is no ground to refute my statement. What I describe is not _noise_, it is merely an effect of the specific sequence of execution. In trying to remove all the noise, BenchmartDotNet, a wonderful tool as it is, creates just a slightly different case. — Nick, Jan 16 '19 at 11:43

C# foreach performance vs memory fragmentation

2 Answers2