C# is half as slow than Java in memory access with loops?

Question

I have two pieces of code that are identical in C# and Java. But the Java one goes twice as fast. I want to know why. Both work with the same principal of using a big lookup table for performance.

Why is the Java going 50% faster than C#?

Java code:

    int h1, h2, h3, h4, h5, h6, h7;
    int u0, u1, u2, u3, u4, u5;
    long time = System.nanoTime();
    long sum = 0;
    for (h1 = 1; h1 < 47; h1++) {
        u0 = handRanksj[53 + h1];
        for (h2 = h1 + 1; h2 < 48; h2++) {
            u1 = handRanksj[u0 + h2];
            for (h3 = h2 + 1; h3 < 49; h3++) {
                u2 = handRanksj[u1 + h3];
                for (h4 = h3 + 1; h4 < 50; h4++) {
                    u3 = handRanksj[u2 + h4];
                    for (h5 = h4 + 1; h5 < 51; h5++) {
                        u4 = handRanksj[u3 + h5];
                        for (h6 = h5 + 1; h6 < 52; h6++) {
                            u5 = handRanksj[u4 + h6];
                            for (h7 = h6 + 1; h7 < 53; h7++) {
                                sum += handRanksj[u5 + h7];
    }}}}}}}
    double rtime = (System.nanoTime() - time)/1e9; // time given is start time
    System.out.println(sum);

It just enumerates through all possible 7 card combinations. The C# version is identical except at the end it uses Console.writeLine.

The lookuptable is defined as:

static int handRanksj[];

Its size in memory is about 120 Megabytes.

The C# version has the same test code. It's measured with Stopwatch instead of nanoTime() and uses Console.WriteLine instead of System.out.println("") but it takes at least double the time.

Java takes about 400ms. For compilation in java I use the -server flag. In C# the build is set to release without debug or trace defines.

What is responsible for the speed difference?

Are you only running it once? If so, that will include JIT time. I suggest you run it several times, but ignore the first few. — Jon Skeet, Mar 11 '11 at 17:13
I would try and run this code at least a few hundred times in a row and measure the combined time - this will give a better overview over the performance differences. — BrokenGlass, Mar 11 '11 at 17:14
@Jon, doesn't Java also have something similar to the .NET JIT? — Thomas Levesque, Mar 11 '11 at 17:18
@Thomas: Absolutely, but maybe the .NET JIT is slower. It's an obvious *potential* difference. — Jon Skeet, Mar 11 '11 at 17:19
You could force the JIT using ngen. http://msdn.microsoft.com/en-us/magazine/cc163610.aspx — Justin, Mar 11 '11 at 17:24
I'd guess that Java is just unrolling the inner loop better (I guess unlikely to be register allocation). — Tom Hawtin - tackline, Mar 11 '11 at 17:25
(Server HotSpot will, I believe, concentrate on the excessively hot inner loop. Server HotSpot does a fair amount of profiling, unlike the cut down Client HotSpot.) — Tom Hawtin - tackline, Mar 11 '11 at 17:32
If speed is so much important, why don't you try to eliminate some sums? For example: sum += handRanksj[u5 + h7], the u5 + u7 sum can be deleted. Precalc u5 + 53, and change the last for in for (h7 = h6 + 1 + u5; h7 < (Precalced53PlusU5); h7++) — xanatos, Mar 11 '11 at 17:32
@xanatos, my understanding of the question wasn't so much "How can I make this run faster", but rather a more theoretical "Why is C# notably slower than Java here", for a simple algorithm that one might think they should turn into almost identical byte-/machine code. — Andrzej Doyle, Mar 11 '11 at 17:35
@Andrzej Yes, it's my understading too, but he is trying to squeeze the last ms in 400ms... So everything could help :-) — xanatos, Mar 11 '11 at 17:36
@Jon: after 100 loops the average in C# was 840ms and the fastest of them was 816ms. — michael, Mar 11 '11 at 17:37
@michael, writing micro benchmarks ain't easy. there are many pitfalls, however hotspot -server does quite decent cycle unroll, so it can be a reason. Post the entire code, I will rewrite the benchmark, so you can test it in a fair way.. — bestsss, Mar 11 '11 at 17:37
@michgael, 100 loops might not ensure proper JIT unfortunately. — bestsss, Mar 11 '11 at 17:39
@bestsss If a minute or so on such a small piece of code doesn't cause a "proper JIT" then I think that's a problem. — Tom Hawtin - tackline, Mar 11 '11 at 17:43
@Tom, hotspot can, indeed, replace code on the stack, but cannot do the best code while the method is still executed (not sure if it keeps the profiling code if replacing an active method). I might be wrong here but for micro-benchmark, I always make sure to exit the method. — bestsss, Mar 11 '11 at 17:47
@bestsss: Does this still count as micro benchmark? It takes at least 400ms. And 400 vs 800 is imho a bit much difference as even the normal Timer should have no problem with this. — michael, Mar 11 '11 at 17:50
@bestsss Not sure about that. It is (was) true that maller methods can help HotSpot, presumably by removing extraneous rubbish that isn't being executed frequently. Beyond a certain size it just gives up, but that should be obvious. — Tom Hawtin - tackline, Mar 11 '11 at 17:51
I have tried the code, initializing an array of 25m integers with random numbers 1...1000, and on my slow laptop (lets say dual core 2.2ghz, Core Duo 7500), in C# it takes less than 600ms (around 580ms) and it's very constant. — xanatos, Mar 11 '11 at 17:51
@michael, 'micro' part of the benchmark is not related to the time spent by the machine but the attempt to extra and test some portion of the code. The code here is simple as it gets, it doesn't even have memory allocation or inheritance or other stuff that would require profiling by the JIT, though. Can you move the code starting w/ `for (h4 = h3 + 1; h4 < 50; h4++)` in a separate static method to see if there are going to be any differences. — bestsss, Mar 11 '11 at 17:54
Is it possible that you have a 64-bit machine, with Java running in 64-bit and .Net running 32-bit (the default now)? — Gabe, Mar 11 '11 at 18:01
@Tom, smaller methods do help the JIT, esp for profiling in the beginning and at some point (if lucky) they end up being inlined (which boosts the performance a ton). The not hot methods may just be left no properly compiled, or even interpreted. — bestsss, Mar 11 '11 at 18:02
I'll add that just "Unrolling" the calculation of (u5 + h7) gave me a boost of 10% (from 580 to 535). — xanatos, Mar 11 '11 at 18:07
@xanatos: thanks. I eleminated now the sum uf u5+h7 but it didn't change the performance. But 4 times unrolling was noticable. It is in c# now under 400ms. I think I have to experiment more with this. — michael, Mar 11 '11 at 18:58
Are you sure that you are compiling for x64 and not x86 or Any CPU? That will make a big difference because you're doing computation with `long` in your inner loop. — Gabe, Mar 11 '11 at 19:05
@Michael test forcing both 32 and 64 bits. The differences can be noticeable. Very noticeable. Use Environment.Is64Bit to check. — xanatos, Mar 11 '11 at 19:17
The compilation is set to any cpu but Environment.Is64BitProcess returns true. — michael, Mar 11 '11 at 21:41

score 10 · Accepted Answer · answered Mar 11 '11 at 17:27

10

If you're timing a C# Debug build, or a Release build from within Visual Studio, you're going to get very misleading timings. Compile in Release mode and either run from the command line or run in Visual Studio without debugging. That is, rather than F5 to run, press Ctrl+F5 to run without debugging.

answered Mar 11 '11 at 17:27

Jim Mischel

131,090
20
188
351

4

On a C# console application that I have for doing batch processing when I would run it in release mode with F5 the initialization to get to my task runner was significant,upwards of 40-50 seconds of waiting. Once I published and ran the same file from disk it went to < 1s. – Chris Marisic Mar 11 '11 at 17:31
Thanks, this was the main problem. Now the C# version is down to 480ms. It's still a bit slower but in the same area. – michael Mar 11 '11 at 18:29

score 1 · Answer 2 · answered Mar 11 '11 at 17:18

Is it possible that one of them is accessing sequential memory in the array (ie. adjacent elements) in sequence while the other is bouncing around all over the place? If so one of them will receive a serious boost from the processor pre-fetching adjacent array elements while the other will not.

That said, when doing a poker hand simulator you may want to try a Monte Carlo simulation instead. The results of the hand will converge far before you've tried all possible 7 card combinations.

If you use a deck of cards object just fix the hands and the board at their current values, then deal out a random board from the deck, shuffle, repeat x amount of times. The values should converge on the actual probabilities long before you've enumerated every possibility.

It is reserved in both cases as one array with the same content. I have another evaluator for Monte Carlo but he computes not even 1/20 of the cards in the same time. — michael, Mar 11 '11 at 18:07

C# is half as slow than Java in memory access with loops?

2 Answers2