How to use modern CPU instructions in Delphi? (Java faster than Delphi?)

Question

A friend sent me a comparison between a recent version of Delphi and Java (source code available if you want it). Believe it or not (better believe it) Java is now significant faster than Delphi because Delphi compiler won't take advantage of modern CPU instructions! A big breakthrough for the 'slow' Java.

My question is: How can we use modern CPU instructions in Delphi WITHOUT resorting to ASM?

The FastCode project was a partial answer to the above question but it is now abandoned. There is any other project similar to FastCode?

This is another article showing that Java and C# it is indeed MUCH faster than Delphi: http://webandlife.blogspot.com/2011/12/c-performance-vs-delphi-performance.html

JAVA

import java.util.Date;

public class j
{
  public static void xxx(int n, int m)
  {
        double t;
        int i, j;
        double d, r;
        t = 0.0;
        for (j = 1; j <= n; j++)
        {
          t = t / 1000.0;
          for (i = 1; i <= m; i++)
          {
                t = t + i / 999999.0;
                d = t * t + i;
                r = (t + d) / (200000.0 * (i + 1));
                t = t - r;
          }
        }
        System.out.println(t);
  }

  public static void main(String [] args)
  {
        Date t1, t2;

        t1 = new Date();
        xxx(1, 999999999);
        t2 = new Date();
        System.out.println((t2.getTime() - t1.getTime())/1000);
        t1 = new Date();
        xxx(1, 999999999);
        t2 = new Date();
        System.out.println((t2.getTime() - t1.getTime())/1000);
  }
}

25 sec

DELPHI

program d;
{$APPTYPE CONSOLE}
uses
  System.SysUtils, System.DateUtils;
var
  t1, t2: TDateTime;

procedure xxx (n: integer; m: integer);
var
  t: double;
  i, j: integer;
  d, r: double;
begin
  t:= 0.0;
  for j:= 1 to n do
  begin
        t:= t / 1000.0;
        for i:= 1 to m do
        begin
          t:= t + i / 999999.0;
          d:= t * t + i;
          r:= (t + d) / (200000.0 * (i + 1));
          t:= t - r;
        end;
  end;
  writeln(t);
end;

begin
  t1:= Now;
  xxx(1, 999999999);
  t2:= Now;
  writeln(SecondsBetween(t2,t1));

  t1:= Now;
  xxx(1, 999999999);
  t2:= Now;
  writeln(SecondsBetween(t2,t1));
end.

37 sec

It seems Delphi is still at the bottom of the chain: http://www.tiobe.com/index.php/content/paperinfo/tpci/index.html

I wonder how Lazarus compares with Delphi from this point of view.

I don't quite understand what you are asking here. You have two questions. One in the question title, and one in the second paragraph. Which one do you want an answer for? — David Heffernan, Sep 04 '13 at 10:14
The only "clarification" I see is "without using ASM". And who is "They"? I believe a game from 1997 runs with way more than 5 times your Vsync on a modern machine if it was compiled as a "real" C or delphi program. Also, real time rendering benchmarks don't tell anything about compiler performance since most of those benchmarks run on the GPU anyway. What are "modern" instructions? Do you mean stuff like SSE or MMX? That would certainly be interesting and be closer to the "real world" instead of just dividing some random numbers. — Günther the Beautiful, Sep 04 '13 at 11:00
The accepted answer is directly referring to code - which apparently has been removed from the question, so now the answer doesn't quite fit. — Jerry Dodge, Sep 04 '13 at 12:13
@Altar It's tool late to remove the code. Two of your answers refer to it. — David Heffernan, Sep 04 '13 at 13:13
@DavidHeffernan In 2011 Delphi got demolished by both Java and C# in the Scimark benchmark. There was a big dust-up on the Internet about it, but it was the same cause: lack of modern instruction set usage. — alcalde, Sep 04 '13 at 18:18
@alcalde Explain why there's so little difference between x86 and x64 versions of my program? Benchmarking is hard. — David Heffernan, Sep 04 '13 at 18:32
By adding a double for the index i, the x64 compiler went from 21 to 13 seconds, while the x32 compiler went from 22 to 23 seconds. D10Seattle. `procedure xxx (n: integer; m: integer); var t: double; i, j: integer; d, r, ii: double; begin t:= 0.0; for j:= 1 to n do begin t:= t / 1000.0; ii := 1.0; for i:= 1 to m do begin t:= t + ii / 999999.0; d:= t * t + ii; ii := ii + 1.0; r:= (t + d) / (200000.0 * ii); t:= t - r; end; end; WriteLn(t); end;` Conclusion, knowing how to help the compiler optimize the code can make a difference. — LU RD, Sep 09 '15 at 21:16
@LURD - This is very interesting. But how did you got here? Is this 'optimization' documented in Delphi's help? — Gabriel, Sep 14 '15 at 18:52
The x64 compiler can potentially make more efficient use of registers (more CPU and SSE registers). In this case the integer variable i is used three times in the calculation loop. As can be seen from one of the answers, it is poorly used both in the x32 and the x64 compiler. By changing the algorithm a little (using a float variable instead of integer), the x64 compiler was able to improve the speed significantly. It boils down to knowing the how the cpu is equipped. Anyway, mixing integers and floats in calculations should be avoided if possible if performance is an issue. — LU RD, Sep 14 '15 at 19:48
@LURD - Yes, but this is a bit like cheating... We are comparing here a 'standard' Java program with a 'standard' Delphi program. I guess most Delphi users will use an integer in the FOR loop because this is what the Delphi manual shows in its examples. They NEVER user a real in their loops. Using the 'direction' you gave, one could say that Delphi program can be improved by writing parts of it in ASM :) :) I am not saying that you codding is wrong, only that is not 'standard' or it is not the 'normal' way of Delphi programming. The Java program I show uses the 'standard' integer for loops. — Gabriel, Sep 15 '15 at 07:29
I don't consider it cheating. You can compare the updated algorithm with a similar in Java if you like. When comparing different compilers and languages, you should always use what suits them best. — LU RD, Sep 15 '15 at 08:47
I would even say that the way I rewrote the algorithm is the canonical way to write this algorithm (if you want it to be computed by a computer, not just an excerpt from a school book). — LU RD, Sep 15 '15 at 09:00
Just ran a test on my 64 bit computer. The Java original program 21 seconds, Java modified 15 seconds. And Delphi x64 went from 21 seconds to 13 seconds. So Delphi wins this test. — LU RD, Sep 15 '15 at 12:47
@LURD-This is great news. Upvote. Can you please post both programs? As an Answer so people can vote it and (possibly) so I can accept it. — Gabriel, Sep 17 '15 at 09:11
PS: I put a question mark after 'Java faster than Delphi' until you post the programs :) :) — Gabriel, Sep 17 '15 at 09:12

Arnaud Bouchez · Accepted Answer · 2015-10-15T15:26:40.413

12

According to your code, what is slow with the 32 bit Delphi compiler is the floating point arithmetic support, which is far from optimized, and copy a lot of content on/to the FPU stack.

In respect to floating point arithmetic, not only Java JITted code will be faster. Even modern JavaScript JIT compilers can be much better than Delphi!

This blog article is just a reference about this, and provide asm-level explanation about Delphi slowness for floating point:

enter image description here

But if you use the Delphi compiler targeting the Win64 platform, it will emit not x87 but SSE2 opcodes, and will be much faster. I suspect comparable to Java JITted executable.

And, in respect to Java, any Delphi executable will use much less memory than the JVM, so here, Delphi executables are perfectly on the track!

If you want your code to be faster, do not use asm nor low-level optimization trick, but change your algorithm. It could be order of magnitude faster than compilation hints. Dedicated process will be achieved with inlined asm opcodes - take a look at this great set of articles for such low level hacks. But it is not easy to master, and usually, proper software profiling and adding some cache is the best way to performance!

edited Oct 15 '15 at 15:26

answered Sep 04 '13 at 11:11

Arnaud Bouchez

42,305
3
71
159

1

In fact the code emitted by the Delphi x64 compiler for this example is no faster than that emitted by the x86 compiler. – David Heffernan Sep 04 '13 at 11:15
1

I disagree that there is no place for ASM optimization. Compilers can't know everything, and especially for compact elementary operations there are a lot of things they can't know. I have a custom ASM routine, for example, that simultaneously computes a complex number subtraction followed by the complex square root of the result - one for x87 and one for SSE. Both benefit from re-use of intermediates, optimized stack manipulation (x87) or simultaneous calculation (SSE), and conditional branching (on intermediate signs or zeroes, etc). No compiled algorithm has scope to do this. – J... Sep 04 '13 at 11:47
The benchmark you refer to is probably not representative. My program is dominated by floating point calcs. It does pretty much nothing else. Difference between x86 and x64 versions typically is 5%. SSE is no magic cure all. – David Heffernan Sep 04 '13 at 12:17
1

@J... I did not write there is no place for ASM optimization: I wrote "Dedicated process will be achieved with inlined asm opcodes" and only stated that this is not easy. ASM should be only the last ressort, when your code is proven and stable, and you have time and need for a dedicated version. Do not forget also that you should better have a "pure pascal" version in addition to tuned ASM version, if you want to be truly cross-platform. This is what we did for our code (e.g. for JSON parsing or generation). – Arnaud Bouchez Sep 04 '13 at 12:45
1

@DavidHeffernan If you know how to minimize the x87 stack use, you can achieve good speed with FPU code, of course. But sometimes, the Delphi 32 bit compiler just produces awfully slow code. [Look at this, and weap](http://delphitools.info/2011/02/08/the-limitations-of-delphis-inline/). For non floating-point code, Win64 will usually be slower as a target platform, since the RTL is much less optimized, and some emitted code is not as good as with 32 bit backend (e.g. "case"). And pointer size increase can also be a problem of CPU cache miss. – Arnaud Bouchez Sep 04 '13 at 12:47
@DavidHeffernan Agreed (re: SSE no magic cure) - I've also noticed the same thing. Anywhere I've found SSE to have been really useful was specifically where the problem could be vectorized to some degree. For straight linear problems it's nothing to write home about. – J... Sep 04 '13 at 13:12
@J... That's right. It seems unlikely that anyone is going to invent a fancy new way to perform scalar arithmetic quicker than anyone has already been discovered. – David Heffernan Sep 04 '13 at 13:14
Don't be abused by such "micro-benchmarks". Efficiency of a platform is not just about compiler generation code. You have to compare not pure mathematical process, but realistic scenarios. If you want to compute fast math, you would probably use optimized C/C++, perhaps with OpenCL. But with real programs, Java could be less efficient than Delphi: slower start time, using much more (MUCH more!) memory, noticeable lag when the GC starts, use StringBuilder for strings.... See e.g. https://robertocschneiders.wordpress.com/2013/01/09/datasnap-analysis-based-on-speed-stability-tests-part-2 – Arnaud Bouchez Oct 15 '15 at 15:27

J... · Answer 2 · 2013-09-04T13:17:05.703

To follow on from Arnaud's point - I actually compiled this in delphi for x86 and x64.

32-bit compiler :

Unit1.pas.36: t:= t / 1000.0;
0051274D DD45F0           fld qword ptr [ebp-$10]
00512750 D835E4275100     fdiv dword ptr [$005127e4]
00512756 DD5DF0           fstp qword ptr [ebp-$10]
00512759 9B               wait 
Unit1.pas.37: for i:= 1 to m do
0051275A 8B45F8           mov eax,[ebp-$08]
0051275D 85C0             test eax,eax
0051275F 7E57             jle $005127b8
00512761 8945D0           mov [ebp-$30],eax
00512764 C745EC01000000   mov [ebp-$14],$00000001
Unit1.pas.39: t:= t + i / 999999.0;
0051276B DB45EC           fild dword ptr [ebp-$14]
0051276E D835E8275100     fdiv dword ptr [$005127e8]
00512774 DC45F0           fadd qword ptr [ebp-$10]
00512777 DD5DF0           fstp qword ptr [ebp-$10]
0051277A 9B               wait 
Unit1.pas.40: d:= t * t + i;
0051277B DD45F0           fld qword ptr [ebp-$10]
0051277E DC4DF0           fmul qword ptr [ebp-$10]
00512781 DB45EC           fild dword ptr [ebp-$14]
00512784 DEC1             faddp st(1)
00512786 DD5DE0           fstp qword ptr [ebp-$20]
00512789 9B               wait 
Unit1.pas.41: r:= (t + d) / (200000.0 * (i + 1));
0051278A DD45F0           fld qword ptr [ebp-$10]
0051278D DC45E0           fadd qword ptr [ebp-$20]
00512790 8B45EC           mov eax,[ebp-$14]
00512793 40               inc eax
00512794 8945CC           mov [ebp-$34],eax
00512797 DB45CC           fild dword ptr [ebp-$34]
0051279A D80DEC275100     fmul dword ptr [$005127ec]
005127A0 DEF9             fdivp st(1)
005127A2 DD5DD8           fstp qword ptr [ebp-$28]
005127A5 9B               wait 
Unit1.pas.42: t:= t - r;
005127A6 DD45F0           fld qword ptr [ebp-$10]
005127A9 DC65D8           fsub qword ptr [ebp-$28]
005127AC DD5DF0           fstp qword ptr [ebp-$10]
005127AF 9B               wait 
Unit1.pas.43: end;
005127B0 FF45EC           inc dword ptr [ebp-$14]
Unit1.pas.37: for i:= 1 to m do
005127B3 FF4DD0           dec dword ptr [ebp-$30]
005127B6 75B3             jnz $0051276b
Unit1.pas.44: end;
005127B8 FF45E8           inc dword ptr [ebp-$18]

64-bit compiler

Unit1.pas.36: t:= t / 1000.0;
000000000059F94E F20F104548       movsd xmm0,qword ptr [rbp+$48]
000000000059F953 F20F5E05BD000000 divsd xmm0,qword ptr [rel $000000bd]
000000000059F95B F20F114548       movsd qword ptr [rbp+$48],xmm0
000000000059F960 C7C001000000     mov eax,$00000001
000000000059F966 8B5568           mov edx,[rbp+$68]
000000000059F969 894544           mov [rbp+$44],eax
000000000059F96C 395544           cmp [rbp+$44],edx
000000000059F96F 7F73             jnle xxx + $C4
000000000059F971 83C201           add edx,$01
Unit1.pas.39: t:= t + i / 999999.0;
000000000059F974 F20F2A4544       cvtsi2sd xmm0,dword ptr [rbp+$44]
000000000059F979 F20F5E059F000000 divsd xmm0,qword ptr [rel $0000009f]
000000000059F981 F20F104D48       movsd xmm1,qword ptr [rbp+$48]
000000000059F986 F20F58C8         addsd xmm1,xmm0
000000000059F98A F20F114D48       movsd qword ptr [rbp+$48],xmm1
Unit1.pas.40: d:= t * t + i;
000000000059F98F F20F104548       movsd xmm0,qword ptr [rbp+$48]
000000000059F994 F20F594548       mulsd xmm0,qword ptr [rbp+$48]
000000000059F999 F20F2A4D44       cvtsi2sd xmm1,dword ptr [rbp+$44]
000000000059F99E F20F58C1         addsd xmm0,xmm1
000000000059F9A2 F20F114538       movsd qword ptr [rbp+$38],xmm0
Unit1.pas.41: r:= (t + d) / (200000.0 * (i + 1));
000000000059F9A7 F20F104548       movsd xmm0,qword ptr [rbp+$48]
000000000059F9AC F20F584538       addsd xmm0,qword ptr [rbp+$38]
000000000059F9B1 8B4544           mov eax,[rbp+$44]
000000000059F9B4 83C001           add eax,$01
000000000059F9B7 F20F2AC8         cvtsi2sd xmm1,eax
000000000059F9BB F20F590D65000000 mulsd xmm1,qword ptr [rel $00000065]
000000000059F9C3 F20F5EC1         divsd xmm0,xmm1
000000000059F9C7 F20F114530       movsd qword ptr [rbp+$30],xmm0
Unit1.pas.42: t:= t - r;
000000000059F9CC F20F104548       movsd xmm0,qword ptr [rbp+$48]
000000000059F9D1 F20F5C4530       subsd xmm0,qword ptr [rbp+$30]
000000000059F9D6 F20F114548       movsd qword ptr [rbp+$48],xmm0
Unit1.pas.43: end;
000000000059F9DB 83454401         add dword ptr [rbp+$44],$01
000000000059F9DF 395544           cmp [rbp+$44],edx
000000000059F9E2 7590             jnz xxx + $54
000000000059F9E4 90               nop
Unit1.pas.44: end;
000000000059F9E5 83454001         add dword ptr [rbp+$40],$01
000000000059F9E9 394D40           cmp [rbp+$40],ecx
000000000059F9EC 0F855CFFFFFF     jnz xxx + $2E
000000000059F9F2 90               nop
Unit1.pas.45: writeln(t);
000000000059F9F3 488B0D9E150300   mov rcx,[rel $0003159e]

Oddly, in this case, the x87 fpu code was actually about ~5% faster. The conclusion is probably nothing more than the fact that Delphi's 32-bit/x87 compiler is extremely mature and rather well optimized and the 64-bit compiler probably has some room to refine in a bit of performance. I can easily see a few places where the SSE code can be optimized here; i, for example, could be stored in an XMM register and re-used rather than being re-converted each time with cvtsi2sd, d could be held in an XMM register for the next calculation rather than being stored and re-loaded, etc.

Unaligned MOVs in and out of XMM registers can be actualy surprisingly expensive. The actual SSE calculations are faster, but the excessive data movement is probably evening out the score. Maybe Java forces 16-byte alignment on the stack? I know MacOS does this and there are definite benefits for SSE to use aligned rather than unaligned moves (at the expense of consuming more stack space, of course).

For example

fild : 1 op, 9 latency (x87)
cvtsi2sd : 2 op, 12 latency (SSE)

or

fld : 1 op, 4 latency (x87)
movsd [r,m] : 2op, 4 latency (SSE)

Delphi's compiler, while emitting SSE instructions, still seems to be treating the workflow in a similar way to the way it would do so with the x87 unit, which isn't necessarily the best way. In either case, David is correct - the compiler is what it is. You can't do anything to change it.

Where I need fast mathematical routines, I still code them in ASM myself - this will generally be superior to anything any compiler can do because you can customize the behaviour to the exact calculations you are doing. I have old legacy 32-bit applications with hand-tuned SSE3 ASM algorithms for complex number arithmetic and matrix operations. The key is that you don't need to optimize everything - you only need to optimize the bottlenecks. This is a rather important point to keep aware of.

A lot of time sounds to be lost in the SSE generated code during single to double conversion. AFAIR you need to tune your code or emit some proper conditional. — Arnaud Bouchez, Sep 04 '13 at 12:51

score 8 · Answer 3 · answered Sep 04 '13 at 13:22

I'm going to answer the meta question here: "why can't the Delphi compiler use more modern CPU instructions, and why can Java?"

Basically, there are two ways to compile code:

pre-compiled on the developers machine
post-compiled (including JITed) on the target machine

Examples of 1. include Delphi, C/C++, etc.
Examples of 2. include Java, .NET, JavaScript, etc.

Pre-compiled environments

Pre-compiled environments have you compile your code once, and run it on your target computers. Compiled programs cannot run on machines using older instruction sets than the compiled program uses. The minimum requirement is the lowest of what the compiler can do best, and the lowest architecture of all your the target machines. If you don't know your target machines, it limited by the compiler.

Post-compiled environments

Post-compiled environments compile on the target machine. You don't have to know what architecture it runs: the compiler that runs on it needs to know what it supports to take best benefit of it. The minimum requirement is the lowest of what the compiler can do best, and the architecture of the target machine.

The reason is that in a post-compiled, JITed or interpreted language environment, the compiler actually runs on the target machine. Which means that the compiler can use all features of that target architecture. It could even take into account aspects like physical memory, cache sizes or disk speed, and measure current run-time performance to do post-compile optimizations on running code.

Delphi and other tools

As for the Windows 32-bit Delphi compiler, I think the minimum requirement is 486 still or the Pentium (given the Pentium-Safe FDIV option). It uses x87 for CPU code because of that.
The Windows 64-bit Delphi compiler has a minimum requirement of SSE instructions which it uses for FPU code.
I've not yet checked the other compiler platforms on minimum requirements.

The minimum Delphi requirements have to do with the emphasis on backward compatibility.

Some other environments (most C/++ compilers, and likely others too) allow you to specify the minimum instruction set. Delphi hasn't. and I think the main reason is complexity of development and testing. The matrix (if it is indeed a 2-dimensional problem) of possibilities gets large really quickly.

JIT compilers usually do not fully support the latest hardware architecture benefits across the board, as it is very expensive to do so.

JIT compilers often do support optimizations for certain processor families (for instance when copying regions of memory).

I know that Java and .NET made quite some advances in this area over the last decade. There is a really good 2005 article on .NET JIT usage of CPU features.

... and JIT suffers from some limitations, e.g. for application wide optmization. This is why the [NGen tool was introduced by MS](http://stackoverflow.com/questions/5525082)... but honestly it is mostly marketing stuff: NGen has also some drawbacks, e.g. reducing [metaprogramming abilities](http://www.amazon.com/s/?url=search-alias=aps&tag=iteb0b-20&link_code=wql&_encoding=UTF-8&field-keywords=978-1-61729-026-8). There is no perfect solution, no definitive word about performance. The main speed factor of any software about to be written is between the chair and the keyboard. — Arnaud Bouchez, Sep 04 '13 at 15:14
+1; @ArnaudBouchez if there were a silver bullet, we all would be without jobs (: — Jeroen Wiert Pluimers, Sep 04 '13 at 16:48
The future: JIT and global static analysis when processors get fast enough. And as a consequence lots more dynamic typing too. :-) — alcalde, Sep 04 '13 at 18:23
@JeroenWiertPluimers-"I know that Java and .NET made quite some advances in this area over the last decade" - They advanced? Or Delphi lagged behind (doesn't keep up)? (This is also related to out topic: http://www.tiobe.com/index.php/content/paperinfo/tpci/index.html ) — Gabriel, Sep 09 '15 at 18:22
All this has nothing to do with JIT. After years of Java and .NET marketing saying JIT is the best, Microsoft switched back to Native AOT compiling... See https://msdn.microsoft.com/en-us/vstudio/dotnetnative.aspx Of course, it is still full of market FUD: ".NET Native compiles C# to native machine code that performs like C++" another fake argument... .NET is much more than a compiler - it is a framework: its object model and its GC do have definitively an impact on performance, and C++ would/could be always faster. Don't trust marketing. Delphi apps could be dead fast, even for sever stuff. — Arnaud Bouchez, Oct 23 '15 at 07:37

score 4 · Answer 4 · answered Sep 17 '15 at 14:13

Background.

The question is how to optimize Delphi code to make it comparable to Java. And doing so without using asm coding.

Analysis.

In the given example, the algorithm is using floating point calculations. The performance and weak points in the compiler has been investigated in other answers. Theoretically the x64 bit compiler could perform better, since the SSE2 opcodes and registers can offer better optimization. So the compiler would be the bottleneck here.

It was also suggested that a better algorithm could improve the performance. Let's look at this a little bit more.

Improving algorithm.

In the algorithm loop, the loop index i is used three times as a variable in the calculations. Since this forces an integer to float conversion each time (upon loading into a fpu or SSE2 register), it will have a big impact on performance. Let's investigate if we can help the compiler to optimize away those conversions.

procedure xxx (n: integer; m: integer);
var
  t,ii: double;
  i, j: integer;
  d, r: double;
begin
  t:= 0.0;
  for j:= 1 to n do
  begin
    t:= t / 1000.0;
    ii:= 1.0;
    for i:= 1 to m do
    begin
      t:= t + ii / 999999.0;
      d:= t * t + ii;
      ii:= ii + 1.0;
      r:= (t + d) / (200000.0 * ii);
      t:= t - r;
    end;
  end;
  writeln(t);
end;

Now we have a clean algorithm using only float values. For reference here is the java code:

public static void xxy(int n, int m)
{
    double t;
    int i, j;
    double d, r, ii;
    t = 0.0;
    for (j = 1; j <= n; j++)
    {
      t = t / 1000.0;
      ii = 1.0;
      for (i = 1; i <= m; i++)
      {
            t = t + ii / 999999.0;
            d = t * t + ii;
            ii = ii + 1.0;
            r = (t + d) / (200000.0 * ii);
            t = t - r;
      }
    }
    System.out.println(t);
}

Benchmark.

Using XE2 compiler.

                     x32      x64     Java(x64) 
                     --------------------------          
Original algorithm   23417ms  22293ms 22045ms
Updated algorithm    22362ms  14059ms 15507ms

The disassembly for the x64 code looks like this:

Project19.dpr.11: begin
000000000046ABC0 55               push rbp
000000000046ABC1 4883EC20         sub rsp,$20
000000000046ABC5 488BEC           mov rbp,rsp
Project19.dpr.12: t:= 0.0;
000000000046ABC8 F20F1005B0000000 movsd xmm0,qword ptr [rel $000000b0]
000000000046ABD0 C7C001000000     mov eax,$00000001
000000000046ABD6 4189C8           mov r8d,ecx
000000000046ABD9 89C1             mov ecx,eax
000000000046ABDB 413BC8           cmp ecx,r8d
000000000046ABDE 7F7B             jnle xxx + $9B
000000000046ABE0 4183C001         add r8d,$01
Project19.dpr.15: t:= t / 1000.0;
000000000046ABE4 F20F5E059C000000 divsd xmm0,qword ptr [rel $0000009c]
Project19.dpr.16: ii := 1.0;
000000000046ABEC F20F100D9C000000 movsd xmm1,qword ptr [rel $0000009c]
000000000046ABF4 C7C001000000     mov eax,$00000001
000000000046ABFA 4189D1           mov r9d,edx
000000000046ABFD 413BC1           cmp eax,r9d
000000000046AC00 7F50             jnle xxx + $92
000000000046AC02 4183C101         add r9d,$01
Project19.dpr.19: t:= t + ii / 999999.0;
000000000046AC06 660F28D1         movapd xmm2,xmm1
000000000046AC0A F20F5E1586000000 divsd xmm2,qword ptr [rel $00000086]
000000000046AC12 F20F58C2         addsd xmm0,xmm2
Project19.dpr.20: d:= t * t + ii;
000000000046AC16 660F28D0         movapd xmm2,xmm0
000000000046AC1A F20F59D0         mulsd xmm2,xmm0
000000000046AC1E F20F58D1         addsd xmm2,xmm1
Project19.dpr.21: ii := ii + 1.0;
000000000046AC22 F20F580D66000000 addsd xmm1,qword ptr [rel $00000066]
Project19.dpr.22: r:= (t + d) / (200000.0 * ii);
000000000046AC2A 660F28D8         movapd xmm3,xmm0
000000000046AC2E F20F58DA         addsd xmm3,xmm2
000000000046AC32 660F28D1         movapd xmm2,xmm1
000000000046AC36 F20F591562000000 mulsd xmm2,qword ptr [rel $00000062]
000000000046AC3E F20F5EDA         divsd xmm3,xmm2
000000000046AC42 660F29DA         movapd xmm2,xmm3
Project19.dpr.23: t:= t - r;
000000000046AC46 F20F5CC2         subsd xmm0,xmm2
Project19.dpr.24: end;
000000000046AC4A 83C001           add eax,$01
000000000046AC4D 413BC1           cmp eax,r9d
000000000046AC50 75B4             jnz xxx + $46
000000000046AC52 90               nop
Project19.dpr.25: end;
000000000046AC53 83C101           add ecx,$01
000000000046AC56 413BC8           cmp ecx,r8d
000000000046AC59 7589             jnz xxx + $24
000000000046AC5B 90               nop
Project19.dpr.26: WriteLn(t);
000000000046AC5C 488B0DC5100100   mov rcx,[rel $000110c5]
000000000046AC63 660F29C1         movapd xmm1,xmm0
000000000046AC67 E874D7F9FF       call @Write0Ext
000000000046AC6C 4889C1           mov rcx,rax
000000000046AC6F E88CD7F9FF       call @WriteLn
000000000046AC74 E877AFF9FF       call @_IOTest
Project19.dpr.27: end;
000000000046AC79 488D6520         lea rsp,[rbp+$20]

The extra integer to float conversions are gone and the registers are better used.

Extra optimization

For the x32 bit compiler, replacing 999999.0; and 200000.0 with reciprocal constants (const cA : Double = 1.0/999999.0; cB : Double = 1.0/200000.0;) and using multiplication instead will also improve performance.

Is there any chance you could add a modern XE8 or 10Seattle compiled results for comparison into the results table? To see if anything has improved since XE2. Much appreciated :-) — Kromster, Sep 18 '15 at 13:17
@KromStern, I tested with D10Seattle, and for x64 it made no difference, but the x32 compiler produced 10% slower code. I did not investigate why, but perhaps I will do that when there is time to spare. I suppose that the extra optimization I suggested for the x32 compiler will compensate for that. I found other means to optimize the algorithm for x32 that was even faster, but left it out for clarity. — LU RD, Sep 18 '15 at 13:28

score 2 · Answer 5 · answered Sep 04 '13 at 11:03

How can we use modern CPU instructions in Delphi (without resorting to ASM)?

If the compiler won't emit the CPU instructions that you wish to use, then there is no alternative to generating the desired instructions yourself, for example by using the inline assembler.

How to use modern CPU instructions in Delphi? (Java faster than Delphi?)

5 Answers5