8

Background

We are currently going though the process of converting our codebase from .Net Framework 4.8 to .Net Core 3.1.

Some of the code is very performance-sensitive. One example is some code that applies a Hamming window filter; I was somewhat dismayed to discover that the .Net Core 3.1-compiled code runs around 30% more slowly than the same code compiled for .Net Framework 4.8.

To reproduce

I created a multitargeted SDK-style project as follows:

<Project Sdk="Microsoft.NET.Sdk">
  <PropertyGroup>
    <OutputType>Exe</OutputType>
    <TargetFrameworkS>net48;netcoreapp3.1</TargetFrameworkS>
    <Optimize>true</Optimize>
  </PropertyGroup>
  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|AnyCPU'">
    <PlatformTarget>x86</PlatformTarget>
  </PropertyGroup>
</Project>

The code for this project is as follows (the important code is inside the for (int iter = ... loop):

using System;
using System.Diagnostics;

namespace FooBar
{
    class Program
    {
        static void Main()
        {
#if NET48
            Console.WriteLine("NET48: Is 64 bits = " + Environment.Is64BitProcess);
#elif NETCOREAPP3_1
            Console.WriteLine("NETCOREAPP3_1: Is 64 bits = " + Environment.Is64BitProcess);
#else
            Invalid build, so refuse to compile.
#endif
            double[] array = new double[100_000_000];
            var sw = Stopwatch.StartNew();

            for (int trial = 0; trial < 100; ++trial)
            {
                sum(array);
            }

            Console.WriteLine("Average ms for calls to sum() = " + sw.ElapsedMilliseconds/100);
            Console.ReadLine();
        }

        static double sum(double[] array)
        {
            double s = 0;

            for (int i = 0; i < array.Length; ++i)
            {
                s += array[i];
            }

            return s;
        }
    }
}

Results

Timing a release x86 build for .Net Core 3.1 and .Net Framework 4.8 I get the following results:

.Net Core 3.1:

NETCOREAPP3_1: Is 64 bits = False
Average ms for calls to sum() = 122

.Net Framework 4.8:

NET48: Is 64 bits = False
Average ms for calls to sum() = 96

Thus the .Net Core 3.1 results are around 30% slower than .Net Framework 4.8.

NOTE: This only affects the x86 build. For an x64 build, the times are similar between .Net Framework and .Net Core.

I find this most disappointing, particularly since I thought that .Net Core would be likely to have better optimization ...

Can anyone suggest a way to speed up the .Net Core output so that it is in the same ballpark as .Net Framework 4.8?


[EDIT] I've updated the code and the .csproj to the latest version I'm using for testing. I added some code to indicate which target and platform is running, just to be certain the right version is being run.

With this edit, I am basically just timing how long it takes to sum all 100,000,000 elements of a large double[] array.

I can reproduce this on both my PCs and my laptop, which are running the latest Windows 10 and Visual Studio 2019 installations + latest .Net Core 3.1.

However, given that other people cannot reproduce this, I will take Lex Li's advice and post this on the Microsoft github page.

Matthew Watson
  • 104,400
  • 10
  • 158
  • 276
  • 1
    With that you can talk to Microsoft guys directly https://github.com/dotnet/runtime/issues – Lex Li Jul 21 '20 at 14:53
  • I'm not sure how accurate was your benchmarks as you used just raw code and Stopwatch. Can you try to use BenchmarkDotNet to reverify this? It could be just different core speed during two runs or something else. BenchmarkDotNet is a benchmarking library that is designed to eliminate all other factors – Lemm Jul 21 '20 at 14:57
  • I found .Net Core to be much faster when I ported my graphics code that runs https://pixeldatabase.Net, which is built in Blazor. Like more than twice as fast for large images, so I guess it just depends on the workload. –  Jul 21 '20 at 14:59
  • I saw that when I reread it. I edited my comment at the same time you were commenting. –  Jul 21 '20 at 15:01
  • @Lemm I don't think BenchmarkDotNet is necessary for this particular code. I mean, it's 30% different, and it's using multiple trials. – Matthew Watson Jul 21 '20 at 15:05
  • Just out of curiosity, could you try running the same test in x64? – Blindy Jul 21 '20 at 15:07
  • @MatthewWatson I've run your code and observe almost the identical results for both target frameworks and debug/release configuration – Pavel Anikhouski Jul 21 '20 at 15:29
  • @MatthewWatson May be to obvious but is this debug or release build – Hasan Emrah Süngü Jul 21 '20 at 15:31
  • 1
    @HasanEmrahSüngü As I said in my post , it's a release build. – Matthew Watson Jul 21 '20 at 15:31
  • @Blindy Yes, that's interesting - .Net Framework and .Net Core take about the same time for x64 – Matthew Watson Jul 21 '20 at 16:01
  • @PavelAnikhouski Are you definitely running an x86 build? – Matthew Watson Jul 21 '20 at 16:05
  • @MatthewWatson yes, just double-checked. The difference is pretty small, about 3-5 milliseconds – Pavel Anikhouski Jul 21 '20 at 16:15
  • 4
    No repro, I see .NETCore faster. A simple explanation is that you have an older processor that has no AVX2 support yet. Do realize what you're chasing. The inner loop is executed a billion times, tells us that you have a 3.6 GHz processor and are trying to find a difference of **one** cpu instruction. That's quite hard to do. The job that the jitter optimizer is critical. As-is the test is invalid, it can't do that job reliably when you mix the testing code with the real code. – Hans Passant Jul 21 '20 at 16:17
  • 3
    Fwiw, the most obvious way to make the code fast is to use Math.Abs() to fill window[] so the if-statement is no longer necessary. – Hans Passant Jul 21 '20 at 16:21
  • @HansPassant I originally was using that, and changed it to the inline code to see if that made any difference, but it didn't make any significant difference to the ratio between the two builds. – Matthew Watson Jul 22 '20 at 08:21

2 Answers2

4

Cannot reproduce.

Looks like .NET Core 3.1 is faster at least for x86. I checked it 5 or more times for each build and the Output is nearly the same.

.NET Framework 4.8

Is 64 bits = False
Computed 4199,58 in 00:00:01.2679838
Computed 4199,58 in 00:00:01.1270864
Computed 4199,58 in 00:00:01.1163893
Computed 4199,58 in 00:00:01.1271687

Is 64 bits = True
Computed 4199,58 in 00:00:01.0910610
Computed 4199,58 in 00:00:00.9695353
Computed 4199,58 in 00:00:00.9601170
Computed 4199,58 in 00:00:00.9696420

.NET Core 3.1

Is 64 bits = False
Computed 4199,580000000003 in 00:00:00.9852276
Computed 4199,580000000003 in 00:00:00.9493986
Computed 4199,580000000003 in 00:00:00.9562083
Computed 4199,580000000003 in 00:00:00.9467359

Is 64 bits = True
Computed 4199,580000000003 in 00:00:01.0199652
Computed 4199,580000000003 in 00:00:00.9763987
Computed 4199,580000000003 in 00:00:00.9612935
Computed 4199,580000000003 in 00:00:00.9815544

Updated with new sample

NET48: Is 64 bits = False
Average ms for calls to sum() = 110

NETCOREAPP3_1: Is 64 bits = False
Average ms for calls to sum() = 110

Hardware

Intel(R) Core(TM) i7-4700HQ CPU @ 2.40GHz

Base speed: 2,40 GHz
Sockets:    1
Cores:  4
Logical processors: 8
Virtualization: Enabled
L1 cache:   256 KB
L2 cache:   1,0 MB
L3 cache:   6,0 MB

Bonus

If the code is so performance-sensitive, maybe SIMD may help.

using System.Numerics;
const int ITERS = 100000;

int vectorSize = Vector<double>.Count;
Console.WriteLine($"Vector size = {vectorSize}");
            
for (int trial = 0; trial < 4; ++trial)
{
    double windowSum = 0;
    sw.Restart();
               
    for (int iter = 0; iter < ITERS; ++iter)
    {
        Vector<double> accVector = Vector<double>.Zero;
        for (int i = 0; i <= window.Length - vectorSize; i += vectorSize)
        {
            Vector<double> v = new Vector<double>(window, i);
            accVector += Vector.Abs(v);
        }
        windowSum = Vector.Dot(accVector, Vector<double>.One);
    }
               
    Console.WriteLine($"Computed {windowSum} in {sw.Elapsed}");
}

Awesomeness of .NET Core is here :)

.NET Core 3.1

Is 64 bits = False
Vector size = 4
Computed 4199,58 in 00:00:00.3678926
Computed 4199,58 in 00:00:00.3046166
Computed 4199,58 in 00:00:00.2910941
Computed 4199,58 in 00:00:00.2900221

Is 64 bits = True
Vector size = 4
Computed 4199,58 in 00:00:00.3446433
Computed 4199,58 in 00:00:00.2616570
Computed 4199,58 in 00:00:00.2606452
Computed 4199,58 in 00:00:00.2582038
aepot
  • 4,558
  • 2
  • 12
  • 24
  • `Btw, what kind of performance do you expect on a PC that cannot run x64 app?` Who said that the PC cannot run a 64-bit app? Certainly not me... – Matthew Watson Jul 22 '20 at 07:57
  • @MatthewWatson I mean some client's PC which is running 32 bit OS. – aepot Jul 22 '20 at 08:47
  • I guess what I need to find out is why it is OK on some systems but not on others. I've only tried it on two PCs, but I see the difference in speed on both of those. – Matthew Watson Jul 22 '20 at 12:13
  • @MatthewWatson can you provide some hardware information? I've added my hardware info to the answer. – aepot Jul 22 '20 at 12:16
  • @MatthewWatson Task Manager => Performance => Right click on CPU => Copy – aepot Jul 22 '20 at 12:19
  • 1
    Sure, one moment. In the meantime, I've just tried it on my laptop (a third test PC for me) and THAT one shows the same times for both! – Matthew Watson Jul 22 '20 at 12:27
  • PC showing a difference: CPU Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz Base speed: 3.41 GHz Sockets: 1 Cores: 4 Logical processors: 8 Virtualisation: Enabled L1 cache: 256 KB L2 cache: 1.0 MB L3 cache: 8.0 MB Utilisation 5% Speed 1.56 GHz Up time 1:18:13:22 Processes 273 Threads 3526 Handles 130907 – Matthew Watson Jul 22 '20 at 12:28
  • @MatthewWatson I'm confused then. I've expected some different CPU. – aepot Jul 22 '20 at 12:29
  • @MatthewWatson try SIMD in .NET Core. It will make the code faster than the fastest .NET Framework 4.8 :) Give it a try. And it you'll can forget about initial performance issue then. – aepot Jul 22 '20 at 12:34
  • Sorry, I was mistaken - I was testing the 64 bit version on my laptop. When I try the 32-bit version, the difference is still there, so that's 3/3 systems I've tried all show a difference now. As for the SIMD, that could be an option long term, but we have literally half a million lines of code in our codebase, so we'd have to do performance checks to target any fixes! – Matthew Watson Jul 22 '20 at 12:38
  • @MatthewWatson "long term" for x4 performance boost for all systems while you're looking for something in rare x86 case that makes approximately no sense in general (except if this is some gov/mil software). I suggest not to spend time on that and try SIMD first, at least you can compare x86 vs x64 performance for it right now and see if it inherits the issue or not. You're moving to .NET Core, why not to try its features e.g. in the most performance-sensitive few lines of code? – aepot Jul 22 '20 at 12:52
  • 2
    That's what I meant: We would have to focus on just the areas where improving performance would be most impactful. This is medical diagnostic software, by the way, so we have to be careful! – Matthew Watson Jul 22 '20 at 12:57
0

Well, I gave it a try, and I included .Net5 as well, and as expected they're pretty much identical in performance.

I would take this as a sign to use more rigorous testing methodologies (Benchmark.NET), because at this point I'm positive you're not running the correct executable, and Benchmark.NET takes care of that for you.

C:\Users\_\source\repos\ConsoleApp3\ConsoleApp3\bin\Release\net48>ConsoleApp3.exe
Computed 4199.58 in 00:00:01.0134120
Computed 4199.58 in 00:00:01.0136130
Computed 4199.58 in 00:00:01.0163664
Computed 4199.58 in 00:00:01.0161655

C:\Users\_\source\repos\ConsoleApp3\ConsoleApp3\bin\Release\net5>ConsoleApp3
Computed 4199.580000000003 in 00:00:01.0269673
Computed 4199.580000000003 in 00:00:01.0214385
Computed 4199.580000000003 in 00:00:01.0295102
Computed 4199.580000000003 in 00:00:01.0241006

C:\Users\_\source\repos\ConsoleApp3\ConsoleApp3\bin\Release\netcoreapp3.1>ConsoleApp3
Computed 4199.580000000003 in 00:00:01.0234075
Computed 4199.580000000003 in 00:00:01.0216327
Computed 4199.580000000003 in 00:00:01.0227448
Computed 4199.580000000003 in 00:00:01.0328213
Blindy
  • 65,249
  • 10
  • 91
  • 131
  • Nah, I'm definitely running the correct executable - it puts the output into different folders, one called release/net48 and the other release/netcoreapp3.1. However, I just rechecked 64 bit versus 32 bit and I am seeing the times being the same for x64. Are your results from x64 or x86? – Matthew Watson Jul 21 '20 at 15:27
  • 64 bit, it's the only way .Net's GC works at its fullest. – Blindy Jul 21 '20 at 15:49
  • Well the results are only meaningful for x86, I'm afraid, since that's where the issue appears to be. – Matthew Watson Jul 21 '20 at 15:55
  • 2
    You'll run into a lot of issues in 32-bit land with .Net, it's generally regarded as a non-platform. Many of the major optimizations only get applied in 64-bit, as well as the aforementioned efficient GC model. Feel free to post your issue in the .Net JIT repository, but don't expect miracles if you can't use the 64-bit runtime. – Blindy Jul 21 '20 at 16:09
  • 1
    `You'll run into a lot of issues in 32-bit land with .Net, it's generally regarded as a non-platform.` Definitely going to have to disagree with that! – Matthew Watson Jul 21 '20 at 16:40
  • You literally just ran into a (possible) regression issue with it, what do you mean you disagree with that? – Blindy Jul 21 '20 at 16:47
  • 1
    @Blindy I also disagree that the preferred target for .NET is `generally regarded as a non-platform`. So would most MVPs and the entire MS compiler and runtime teams. When you select `Any CPU` the preferred target is x86. Never mind that ARM is going to be a *lot* more important going forward – Panagiotis Kanavos Jul 21 '20 at 17:28
  • 1
    Actually it's not in more modern runtimes like Net Core 3 and Net 5. "Prefer 32-bit" is disabled by default. And please don't speak for others. – Blindy Jul 21 '20 at 17:49
  • @Blindy Well, do be aware that for ARM on Windows, *only* 32-bit builds are available. – Matthew Watson Jul 22 '20 at 08:52