-1

I have a micro-optimization issue. I have 3 methods for processing typed-Pointer(array) . Which one is better?

1

for I:=0 to ArrCount-1 do
begin  // I:Var is unused in below-block
  Inc(P) ; // P is typed-Pointer
  // do somethings 
end;

2

for I:=ArrCount-1  downto  0 do
begin  // I:Var is unused in below-block
  Inc(P) ; // P is typed-Pointer
  // do somethings 
end;

3

While ArrCount>0 do
begin  
  Inc(P) ; // P is typed-Pointer
  
  // do somethings 
  Dec(ArrCount);
end;
Community
  • 1
  • 1
MajidTaheri
  • 3,813
  • 6
  • 28
  • 46
  • 8
    Measure it. If you can't detect the difference, then it doesn't matter. – gabr Aug 20 '15 at 07:29
  • Its far more likely that the "do something" will hurt more than the loop. In fact, optimizing that could result in you choosing a different type of loop. You are better off telling us what you would like to accomplish because using something like SIMD would do away with the loop all together. – Graymatter Aug 20 '15 at 08:27
  • You can measure it yourself using something like `TStopWatch`. – Craig Aug 20 '15 at 10:30
  • None of those is better than the other without considering what is actually happening in the loop. Benchmark the options using your actual code and compare the results. You're saying "Which is faster, a Ferrari or a Ford F350?", and the answer is "It depends what they're doing. For laps on a race track, the Ferrari is faster. For hauling two tons of gravel, the Ford is faster." - measure the task being done, and then decide on a solution. – Ken White Aug 20 '15 at 12:35
  • You probably have a fourth way that doesn't use a loop: a recursive function. – Abstract type Aug 21 '15 at 08:29

2 Answers2

5

The answer that I will give to this question is rather more mundane than perhaps you are expecting. The fastest of these variants is the one that, wait for it, is timed to run most quickly.

It's entirely plausible that on different architectures you'll find that different variants win.

It's also conceivable that different variants will win depending on what is in the body of the loop.

It's also quite possible that the body of the loop takes sufficient time that the loop itself is negligible in comparison.

In short, it depends. Since only you know what happens inside the body, only you can answer the specific question.

As an aside, if the loop body does not refer to the loop variable, then the compiler re-writes the ascending loop as if it were a descending loop. So there may in fact be only two variants here. Indeed, that might mean that all three variants lead to identical compiled code!

Some advice:

  • Never optimise without profiling.
  • Never optimise code that is not a bottleneck.

Now, if you want me to take a guess, I predict that for any loop body that is more than a trivial nop, you'll find it hard to find any measurable difference between these variants.

I also see that you are using a pointer to walk across an array. You might find that if this code is a bottleneck, and if the loop body just handles this array iteration, that using arr[] indexing is more effective that pointer arithmetic. But again, it depends on many things and you have to profile, and look at the code the compiler produces.

David Heffernan
  • 601,492
  • 42
  • 1,072
  • 1,490
  • this code use in most important functions(with many-many call) – MajidTaheri Aug 20 '15 at 07:51
  • @David: AFAIK, even if it refers to the loop variable, but the order of execution is totally unimportant (e.g. in a loop filling an array with the values of the loop value, `a[i] := i;`), the descending loop form is chosen. Only if the order makes a difference in the outcome of the program, the loop is not modified. – Rudy Velthuis Aug 20 '15 at 21:30
  • @Rudy That would take quite a bit of analysis from the compiler. I'd be surprised if it could manage that. – David Heffernan Aug 20 '15 at 21:56
  • Hmmm... I'm flabbergasted. I could have sworn I have seen this in (OK, admittedly simple) situations where the for-loop variable was used and it still ran "backward" (when optimization was on). But now I can't reproduce it. I was so sure. – Rudy Velthuis Aug 20 '15 at 22:09
-1

Funny, but looking at disassembly window the speed is depending on weather is the loop variable used inside loop.

1) Not using - code is almost identical:

Project17.dpr.12: for i := 0 to 3 do
0040914D B804000000       mov eax,$00000004
Project17.dpr.13: Inc(j);
00409152 43               inc ebx
Project17.dpr.12: for i := 0 to 3 do
00409153 48               dec eax
00409154 75FC             jnz $00409152

Project17.dpr.15: for i := 3 downto 0 do
00409156 B8FCFFFFFF       mov eax,$fffffffc
Project17.dpr.16: Inc(j);
0040915B 43               inc ebx
Project17.dpr.15: for i := 3 downto 0 do
0040915C 40               inc eax
0040915D 75FC             jnz $0040915b

2) Used - first variant faster a bit because xor faster then mov:

Project17.dpr.12: for i := 0 to 3 do
0040914D 33C0             xor eax,eax
Project17.dpr.13: Inc(j, i);
0040914F 03D8             add ebx,eax
00409151 40               inc eax
Project17.dpr.12: for i := 0 to 3 do
00409152 83F804           cmp eax,$04
00409155 75F8             jnz $0040914f

Project17.dpr.15: for i := 3 downto 0 do
00409157 B803000000       mov eax,$00000003
Project17.dpr.16: Inc(j, i);
0040915C 03D8             add ebx,eax
0040915E 48               dec eax
Project17.dpr.15: for i := 3 downto 0 do
0040915F 83F8FF           cmp eax,-$01
00409162 75F8             jnz $0040915c

You can check third variant yourself.

PS: I am using D2007 for this test.

Abelisto
  • 14,826
  • 2
  • 33
  • 41
  • 1
    I can smell premature optimisation, and conclusions drawn without profiling. I also wonder how you determined that calling `xor` rather than `mov` once will make a measurable difference. Do you have any timings or evidence to back that up. This feels like we've awarded the gold medal to Usain Bolt already without actually running the race. – David Heffernan Aug 20 '15 at 08:10
  • @DavidHeffernan Yes, this brief investigation is more academical then practical. But in any case it is useful to know what happens "behind the scenes". – Abelisto Aug 20 '15 at 08:16
  • 3
    The problem I have with this is that it leads the asker into believing that performance is easily predicted simply by looking at code. For sure it's important to study the code that the compiler produces. But one has to time the code also. Without timing, wrong conclusions will be drawn. – David Heffernan Aug 20 '15 at 08:20
  • @DavidHeffernan About old CPUs at least 8088-80386 `xor Reg, Reg` was definitely faster then `mov Reg, 0`. I believe that Borland doing `xor` in this case not for nothing :) – Abelisto Aug 20 '15 at 08:27
  • firstly initialize(mov or xor) counter cannot be effective secondly comparison is more effective,comparison of #2 is faster than comparison of #1 because it is direct value(zero) also `Dec` is a bit faster than `Inc` – MajidTaheri Aug 20 '15 at 11:38
  • 1
    The performance of extremely old CPUs that are most likely not in use any longer is not relevant. Neither is the ASM generated by Turbo Pascal 1.0. Times change, and using decades-old details to try and support your position is a waste of space and time. – Ken White Aug 20 '15 at 12:32
  • @Majid It looks like you ignored all the advice in response your question and homed in on xor and dec. – David Heffernan Aug 20 '15 at 13:11
  • @DavidHeffernan your all advice are reasonable but my functions are not simple and not share-able,also I use `ProDelphi` for profiling , I found this code as bottleneck,because same code execute a millon time – MajidTaheri Aug 20 '15 at 13:36
  • @MajidTaheri You don't seem to understand what we've been trying to explain to you. I'm going to give up. It's not fun for me anymore. – David Heffernan Aug 20 '15 at 13:44
  • 1
    @Majid, consider inlining functions that are called so many times. – TLama Aug 20 '15 at 16:15
  • @TLama That can makes things worse also – David Heffernan Aug 20 '15 at 19:32
  • @David, I can't imagine why. Could you shortly elaborate why (if possible), please ? – TLama Aug 20 '15 at 19:37
  • 1
    @TLama If the function is long, then you can increase the code size. That then reduces the amount of cache available for data and increases the rate of cache misses. Only small function should be inlined. – David Heffernan Aug 20 '15 at 20:15
  • @David, thanks, I dind't think about that! Good to know. – TLama Aug 20 '15 at 20:44