0

I have a numeric intensive application and after looking for GFLOPS on the internet, I decided to do my own little benchmark. I just did a single thread matrix multiplication thousands of times to get about a second of execution. This is the inner loop.full

for (int i = 0; i < SIZEA; i++)
    for (int j = 0; j < SIZEB; j++)
        vector_out[i] = vector_out[i] + vector[j] * matrix[i, j];

It's been years since I dealt with FLOPS, so I expected to get something around 3 to 6 cycles per FLOP. But I am getting 30 (100 MFLOPS), surely if I parallelize this I will get more but I just did not expect that. Could this be a problem with dot NET. or is this really the CPU performance?

Here is a fiddle with the full benchmark code.

EDIT: Visual studio even in release mode takes longer to run, the executable by itself it runs in 12 cycles per FLOP (250 MFLOPS). Still is there any VM impact?

Arturo Hernandez
  • 2,749
  • 3
  • 28
  • 36
  • 7
    Given C# compiles to IL that will ultimately be converted to x86, x64 or various ARM (to name but 3) architectures, there's not going to be a single answer to this. If performance is critical, C# isn't the right tool for (this part of) the job. – Damien_The_Unbeliever Mar 20 '15 at 15:41
  • @Damien_The_Unbeliever's comment is correct (and should be the answer IMO). It will depend on the target architecture. – amura.cxg Mar 20 '15 at 15:48
  • @Damien and mura, that is the question, how much is the architecture and how much may it be dot net. In such a simple code it may be that we are running at the same speed as the iron. – Arturo Hernandez Mar 20 '15 at 15:54
  • 2
    https://msdn.microsoft.com/en-us/library/ms973852.aspx#fastmanagedcode_topic2 – Gary Kaizer Mar 20 '15 at 15:55
  • C# is getting optimized more and more. Your processor probably has processor units specifically for doing multiplications. – MrFox Mar 20 '15 at 15:55
  • This could be a candidate for F#. Curious what the benchmark would be in it. – TyCobb Mar 20 '15 at 15:56
  • 1
    Your disappointing results are coming more from `vector_out[i]` guarded by `i < SIZEA`. Learn to use C# properly, this is (converted) C code. – H H Mar 20 '15 at 15:57
  • @TyCobb I was using F# and wanted to check the timing I was getting. – Arturo Hernandez Mar 20 '15 at 17:14

1 Answers1

2

Your bench mark doesn't really measure FLOPS, it does some floating point operations and looping in C#.

However, if you can isolate your code to a repetition of just floating point operations you still have some problems.

Your code should include some "pre-cycles" to allow the "jitter to warm-up", so you are not measuring compile time.

Then, even if you do that,

You need to compile in release mode with optimizations on and execute your test from the commmand-line on a known consistent platform.


Fiddle here

Here is my alternative benchmark,

using System;
using System.Linq;
using System.Diagnostics;

class Program
{
    static void Main()
    {
        const int Flops = 10000000;
        var random = new Random();
        var output = Enumerable.Range(0, Flops)
                         .Select(i => random.NextDouble())
                         .ToArray();
        var left = Enumerable.Range(0, Flops)
                         .Select(i => random.NextDouble())
                         .ToArray();
        var right = Enumerable.Range(0, Flops)
                         .Select(i => random.NextDouble())
                         .ToArray();

        var timer = Stopwatch.StartNew();
        for (var i = 0; i < Flops - 1; i++)
        {
            unchecked
            {
                output[i] += left[i] * right[i];
            }
        }

        timer.Stop();
        for (var i = 0; i < Flops - 1; i++)
        {
            output[i] = random.NextDouble();
        }

        timer = Stopwatch.StartNew();
        for (var i = 0; i < Flops - 1; i++)
        {
            unchecked
            {
                output[i] += left[i] * right[i];
            }
        }

        timer.Stop();

        Console.WriteLine("ms: {0}", timer.ElapsedMilliseconds);
        Console.WriteLine(
            "MFLOPS: {0}",
            (double)Flops / timer.ElapsedMilliseconds / 1000.0);
    }
}

On my VM I get results like

ms: 73
MFLOPS: 136.986301...

Note, I had to increase the number of operations significantly to get over 1 millisecond.

Jodrell
  • 34,946
  • 5
  • 87
  • 124
  • I really appreciate your answer. I just want to clarify that like you I only timed the inner loop. – Arturo Hernandez Mar 20 '15 at 16:54
  • @ArturoHernandez in your linked example you call `Stopwatch.StartNew()` which constructs and starts a timer immediately, the subsequent `_timer.Start()` does not reset the timer. You can check the remarks here https://msdn.microsoft.com/en-us/library/system.diagnostics.stopwatch.start%28v=vs.110%29.aspx – Jodrell Mar 20 '15 at 16:56
  • true but only an issue on the first iteration out of 100000. Once corrected got same answer. I still need to run your code. Tks!!!! – Arturo Hernandez Mar 20 '15 at 17:03
  • since you are using millisecs you got more like 136 mflops. Just like what I got running inside Visual Studio. – Arturo Hernandez Mar 20 '15 at 17:12
  • @ArturoHernandez, you are correct, I've amended accordingly. – Jodrell Mar 20 '15 at 17:19
  • You amended your time, which is the now same as mine. But the negative comments at the top are not yet amended. – Arturo Hernandez Mar 20 '15 at 18:10
  • @ArturoHernandez the negative comments at the top are still valid, your linked benchmark code does not reset the timer so the initial overhead is incorporated into the first iteration. Whether or not we get the same timing is coincidence, we are not running on the same machine. – Jodrell Mar 23 '15 at 09:38
  • coincidence? plus, did you read? Stopwatch _timer = stopwatch.StartNew(); _timer.Stop(); – Arturo Hernandez Mar 23 '15 at 15:17
  • @ArturoHernandez, okay, you don't measure the initialization. If we we get similar results from either benchmark it doesn't really tell us much. Not even that our machines are similar, just that they could be and that they have a similar performance for a given benchmark. No real meaning can be inferred without a much deeper analysis of the IL involved and the instructions implementation on a given platform. If you need fast floating point operations in .Net look at something like CUDAfy https://cudafy.codeplex.com/. – Jodrell Mar 23 '15 at 15:40
  • I ran your code in my machine. Thanks for the CUDAfy reference. Looking at http://goo.gl/PgR8fR it is just not clear where is the difference for a single core. Is it 'NET, or the code or the machine? This seems to be a difficult question. – Arturo Hernandez Mar 23 '15 at 15:51
  • @JodrelI I just did not want your answer from discouraging others from answering. I would have been. – Arturo Hernandez Mar 23 '15 at 15:55