Why is using a pointer for a for loop more performant in this case?

Question

I don't have a background in C/C++ or related lower-level languages and so I've never ran into pointers before. I'm a game dev working primarily in C# and I finally decided to move to an unsafe context this morning for some performance-critical sections of code (and please no "don't use unsafe" answers as I've read so many times while doing research, as it's already yielding me around 6 times the performance in certain areas, with no issues so far, plus I love the ability to do stuff like reverse arrays with no allocation). Anyhow, there's a certain situation where I expected no difference, or even a possible decrease in speed, and I'm saving a lot of ticks in reality (I'm talking about double the speed in some instances). This benefit seems to decrease with the number of iterations, which I don't fully understand.

This is the situation:

int x = 0;
for(int i = 0; i < 100; i++)
    x++;

Takes, on average about 15 ticks.

EDIT: The following is unsafe code, though I assumed that was a given.

int x = 0, i = 0;
int* i_ptr;
for(i_ptr = &i; *i_ptr < 100; (*i_ptr)++)
    x++;

Takes about 7 ticks, on average.

As I mentioned, I don't have a low-level background and I literally just started using pointers this morning, at least directly, so I'm probably missing quite a bit of info. So my first query is- why is the pointer more performant in this case? It isn't an isolated instance, and there are a lot of other variables of course, at that specific point in time in relation to the PC, but I'm getting these results very consistently across a lot of tests.

In my head, the operations are as such:

No pointer:

Get address of i
Get value at address

Pointer:

Get address of i_ptr
Get address of i from i_ptr
Get value at address

In my head, there must surely be more overhead, however ridiculously negligible, from using a pointer here. How is it that a pointer is consistently more performant than the direct variable in this case? These are all on the stack as well, of course, so it's not dependent on where they end up being stored, from what I can tell.

As touched on earlier, the caveat is that this bonus decreases with the number of iterations, and pretty fast. I took out the extremes from the following data to account for background interference.

At 1000 iterations, they are both identical at 30 to 34 ticks.

At 10000 iterations, the pointer is slower by about 20 ticks.

Jump up to 10000000 iterations, and the pointer is slower by about 10000 ticks or so.

My assumption is that the decrease comes from the extra step I covered earlier, given that there is an additional lookup, which brings me back to wonder why it's more performant with a pointer than without at low loop counts. At the very least, I'd assume they would be more or less identical (which they are in practice, I suppose, but a difference of 8 ticks from millions of repeated tests is pretty definitive to me) up until the very rough threshold I found somewhere between 100 and 1000 iterations.

Apologies if I'm nitpicking somewhat, or if this is a poor question, but I feel as though it will be beneficial to know exactly what is going on under the hood. And if nothing else, I think it's pretty interesting!

Good chances are, your benchmark is not set up correctly to capture the number of ticks. A combination of compiler and hardware optimizations make writing such micro-benchmarks very hard. — Sergey Kalinichenko, Feb 28 '17 at 16:24
First of all, can you at least post code that compiles? The second batch needs to be in an `unsafe` context, and `i` doesn't exist. Also, are you running with compiler optimisations turned on? The differences can be rather significant. — DavidG, Feb 28 '17 at 16:25
What you write is not what you get, abstract operations are not what is executed (for example, in the first case there will definitely be no address of `i` to get, it'll be in a register). Compare the asm. — harold, Feb 28 '17 at 16:26
@DavidG sorry, just wrote everything up quickly. And yep, compiler optimisation is active. — Josh Alexander, Feb 28 '17 at 16:30
@DavidG there is always asm. How else is it executing. Disable "suppress optimizations", hit a breakpoint and switch to disassembly view, it's easy. — harold, Feb 28 '17 at 16:33
Of course there is assembly code; the computer is running it somehow! Take a look at what code is actually generated. Absent that information, my guess is the same as the others; this is probably an artifact of your testing regimen. You may, for instance, be actually measuring the difference in the time taken to jit-compile the code, not to run the code. If that's not it, then it is probably small differences in register allocation. — Eric Lippert, Feb 28 '17 at 16:35
@harold OK, yes, at the lowest level there will be of course, but it's much easier to look at the generated IL. — DavidG, Feb 28 '17 at 16:36
Incidentally, I find your assertion that "now I can reverse an array without an allocation" to be somewhat bizarre; what allocation are you allocating when reversing an array in-place without using pointers? — Eric Lippert, Feb 28 '17 at 16:36
@EricLippert maybe it is `array = array.Reverse().ToList().ToArray();` - at least 3 new arrays :) — Alexei Levenkov, Feb 28 '17 at 16:38
@DavidG easier yes, but a lot less useful since almost all optimization is done by the JIT compiler — harold, Feb 28 '17 at 16:39
@harold You may be right there. I guess I'm just jaded by having to write a flight sim in asm many years ago as part of my uni project! Natural reaction now is to run away :) — DavidG, Feb 28 '17 at 16:40
Good point @EricLippert, kind of a weird misstep on my part, I think it just came from what I'd seen staff members doing where they would allocate a new array of the same length and assign in the reverse order, could be done with a single byte without pointers of course, came as a nice test for me though (around 7 times faster than Array::Reverse). But yep, a strange assertion, I'm tired this evening. — Josh Alexander, Feb 28 '17 at 16:42
Benchmarking very fast code like this accurately is quite hard to do. In fact the pointer version is ~4 times slower in optimized Release built code. There is a [benchmarking package](https://www.nuget.org/packages/BenchmarkDotNet/) available that is likely to help you measure it correctly. — Hans Passant, Feb 28 '17 at 16:52
Thanks @HansPassant, seems to be pretty conclusive that the main cause is mostly likely measurement inaccuracy, I'll check it out. — Josh Alexander, Feb 28 '17 at 17:07

score 1 · Accepted Answer · answered Feb 28 '17 at 20:35

Some users suggested that the test results were most likely due to measurement inaccuracies, and it would seem as such, at least upto a point. When averaged across ten million continuous tests, the mean of both is typically equal, though in some cases the use of pointers averages out to an extra tick. Interestingly, when testing as a single case, the use of pointers has a consistently lower execution time than without. There are of course a lot of additional variables at play at the specific points in time at which a test is tried, which makes it somewhat of a pointless pursuit to track this down any further. But the result is that I've learned some more about pointers, which was my primary goal, and so I'm pleased with the test.

Why is using a pointer for a for loop more performant in this case?

1 Answers1