1

I'm in the process of implementing different algorithms on CPUs and GPUs. What struck me as odd was that a very primitive example (sequentially - aka 1 thread - creating a histogram of an array with 100*1024*1024 elements) takes 200% - 300% longer on a server CPU (which is admittedly slightly lower clocked and one generation older) than it does on a workstation CPU. Both machines use DDR3 memory, 16GB dual channel on the workstation (FSB:DRAM 1:6) and 512GB quad channel on the server (FSB:DRAM 1:12), both running at 800Mhz DRAM clock rate.

On my workstation the histogram calculation takes <100ms (90ms on average) while on the server it takes 300ms on average, while on sporadic occurrences it takes only around 150ms.

I'm using the same build on both machines (any CPU, prefer 32bit, release build).

On another question, why is it that a pure 64bit build is slower on both machines by at least 25%?

public static void Main(string[] args) {
    // the array size. say its 100 * 1024 ^ 2, aka 100 Megapixels
    const int Size = 100 * 1024 * 1024;

    // define a buffer to hold the random data
    var buffer = new byte[Size];

    // fill the buffer with random bytes
    var rndXorshift = new RndXorshift();
    rndXorshift.NextBytes(buffer);

    // start a stopwatch to time the histogram creation
    var stopWatch = new Stopwatch();
    stopWatch.Start();

    // declare a variable for the histogram
    var histo = new uint[256];

    // for every element of the array ...
    for (int i = 0; i < Size; i++) {
        // increment the histogram at the position
        // of the current array value
        histo[buffer[i]]++;
    }

    // get the histogram count. must be equal
    // to the total elements of the array
    long histoCount = 0;

    for (int i = 0; i < 256; i++) {
        histoCount += histo[i];
    }

    // stop the stopwatch
    stopWatch.Stop();
    var et1 = stopWatch.ElapsedMilliseconds;

    // output the results
    Console.WriteLine("Histogram Sum: {0}", histoCount);
    Console.WriteLine("Elapsed Time1: {0}ms", et1);
    Console.ReadLine();
}

Server CPU: Server CPU

Workstation CPU: Workstation CPU

lightxx
  • 1,037
  • 2
  • 11
  • 29
  • 1
    With such trivial loop the brute MHz factor is pretty relevant (and TurboBoost will influence results a lot). Note however that your buffer is really small. With bigger data size then also cache size will determine result. Finally...don't forget that benchmark on (working) server is pretty aleatory. In short: **this test is completely useless unless it's exactly code you need to measure**, if it's just a test then it's meaningless. – Adriano Repetti Jul 10 '14 at 11:31
  • OK. even after changing the power plan to high performance the server is 33% percent slower. I guess that's the difference in clock rate and architecture... – lightxx Jul 10 '14 at 11:37
  • @AdrianoRepetti It's not that useless. The actual algorithms in question are much more sophisticated, I just came up with the most simplistic real world example (calculating a histogram) I could think of. Also, by server I mean a 19" unit that sits around idling in my lab ... so it's not that aleatory ... BTW, I never heard that word before :D – lightxx Jul 10 '14 at 11:41
  • 1
    I mean: performance are determined by many many factors. Brute GHz, memory speed, processor architecture, instruction set, cache size and speed and also number of cores/CPUs. If you're measuring something that it's not your code (unless it's fine tuned to exactly match it) then results won't necessarily reflect what you'll get with true code. That's why standard benchmarks are pretty articulated (and we have different benchmarks). Calculate that histogram with 256 elements then try again with 1.000.000 elements. Now make it parallel. Now let's make calculation little bit more complicate... – Adriano Repetti Jul 10 '14 at 11:57
  • ...you'll get very different results. Also in .NET we have JIT and it may use extended instruction sets, if available. Move to 32 to 64 bit and you'll also experience a completely different JIT implementation. IMO there are too many factors to project them in another context (from fake code to real code). About servers I mean that if it's a server probably it has its own tasks to perform then performance will vary (even more) over time because of that. – Adriano Repetti Jul 10 '14 at 12:01
  • thank you very much for your elaborate answer!!! – lightxx Jul 10 '14 at 12:31

1 Answers1

3

Server CPU clock shows 1177 MHz, while workstation has 3691 MHz clock. That would explain the difference.

It seems your server has either a CPU that slows down if not unders stress, to conserve energy, or the multipliers in BIOS are set to very low values.

Dariusz
  • 21,561
  • 9
  • 74
  • 114
  • now that's odd ... and you're right. it would almost perfectly explain the difference ... weird... – lightxx Jul 10 '14 at 11:29
  • 1
    My first thought was the energy saving mode, with a Dell server we were able to double the performance of a Java app behaving erratically. So it's time to fiddle in the Bios settings. – Andreas Jul 10 '14 at 11:32
  • 1
    MHz are a factor (here) because of trivial loop and small data size (just 256 uints). With bigger buffer cache speed and size will also influence results (a lot). Finally also CPU model is important because of i7 TurboBoost. – Adriano Repetti Jul 10 '14 at 11:32
  • stupid windows for using a "balanced" power plan on servers per default. stupid admins for not changing it. stupid me for not seeing it. – lightxx Jul 10 '14 at 11:35
  • I'm actually surprised that this would depend on core freq, the main bottleneck here should be the sequential access to `buffer` (the histogram is very likely cached completely) - this is a pure memory bandwidth benchmark, and servers usually have the upper hand there. Maybe the 2 extra generations in CPU architecture have some benefit there (perhaps due to HW prefetching?) – Leeor Jul 10 '14 at 21:56