SSE: Mass integer conversion+multiply slower with SSE than FPU?

Question

I'm working on an application that very often needs to convert 6 to 8 signed 32 bit integers to 32 bit real numbers. I replaced the delphi code with custom assembler code and to my great surprise the FPU conversion is always as fast and on some computers a good amount faster than the SSE conversion. Here's some code that illustrates:

program Project1;

{$R *.res}

uses
 windows,dialogs,sysutils;

type
 piiii=^tiiii;
 tiiii=record i1,i2,i3,i4:longint; end;
 pssss=^tssss;
 tssss=record s1,s2,s3,s4:single; end;

var
 convert_value:single=13579.02468;

function convert_x87(adata:longint):single;
asm
 mov [esp-4],eax
 fild longint([esp-4])
 fmul [convert_value]
end;

procedure convert_sse(afrom,ato,aconv:pointer);
asm
 CVTDQ2PS xmm0,[eax]
 mulps xmm0,[ecx]
 movaps [edx],xmm0
end;

procedure get_mem(var p1,p2:pointer);
begin
 getmem(p1,31);
 p2:=pointer((longint(p1)+15) and (not 15));
end;

var
 a,b,c,d:cardinal;
 z:single;
 i:piiii;
 s1,s2:pssss;
 w1,w2,w3:pointer;
begin
 b:=gettickcount;
 a:=0;
 repeat
  z:=convert_x87(a);

  inc(a);
 until a=0;
 c:=gettickcount-b;

 get_mem(pointer(w1),pointer(i));
 get_mem(pointer(w2),pointer(s1));
 get_mem(pointer(w3),pointer(s2));

 s1.s1:=convert_value;
 s1.s2:=convert_value;
 s1.s3:=convert_value;
 s1.s4:=convert_value;

 b:=gettickcount;
 i.i1:=0;
 i.i2:=1;
 i.i3:=2;
 i.i4:=3;
 repeat
  convert_sse(i,s2,s1);

  inc(i.i1,4);
  inc(i.i2,4);
  inc(i.i3,4);
  inc(i.i4,4);
 until i.i1=0;
 d:=gettickcount-b;

 freemem(w1);
 freemem(w2);
 freemem(w3);

 showmessage('FPU:'+inttostr(c)+'/SSE:'+inttostr(d));
end.

There needs to be a rescaling (so a multiply) during conversion, that's why there's one in there. The value used is just a random one I picked, but the result was the same no matter what value I used. Also there is a very tiny difference in rounding between the FPU and SSE but it doesn't matter in this case.

But if you run that code you'll see that the FPU path is never slower than the SSE path and it doesn't make sense. Anyone have an idea what's going on?

EDIT: Here's different source code with the loop in assembler. The results are really interesting. If the increment instructions are commented out, the SSE version is faster than the FPU version by a noticable amount, but if the increment instructions are included then they are roughly the same speed:

program Project1;

{$R *.res}

uses
 windows,dialogs,sysutils;

type
 piiii=^tiiii;
 tiiii=record i1,i2,i3,i4:longint; end;
 pssss=^tssss;
 tssss=record s1,s2,s3,s4:single; end;

var
 convert_value:single=13579.02468;

procedure test_convert_x87;
asm
 // init test data
 push ebx
 xor ebx,ebx

 mov [esp-4],$98765432

 // convert and multiply 1 int32 to 1 single
@next_loop:
// inc [esp-4]
 fild longint([esp-4])
 fmul [convert_value]
 fstp single([esp-8])

 // loop
 dec ebx
 jnz @next_loop

 pop ebx
end;

procedure test_convert_sse(afrom,ato,aconv:pointer);
asm
 // init test data
 push ebx
 xor ebx,ebx

 mov [eax+0],$98765432
 mov [eax+4],$98765432
 mov [eax+8],$98765432
 mov [eax+12],$98765432

 // convert and multiply 4 int32 to 4 single
@next_loop:
// inc [eax+0]
// inc [eax+4]
// inc [eax+8]
// inc [eax+12]
 cvtdq2ps xmm0,[eax]
 mulps xmm0,[ecx]
 movaps [edx],xmm0

 // loop
 sub ebx,4
 jnz @next_loop

 pop ebx
end;

procedure get_mem(var p1,p2:pointer);
begin
 getmem(p1,31);
 p2:=pointer((longint(p1)+15) and (not 15));
end;

var
 b,c,d:cardinal;
 i:piiii;
 s1,s2:pssss;
 w1,w2,w3:pointer;
begin
 b:=gettickcount;
 test_convert_x87;
 c:=gettickcount-b;

 get_mem(pointer(w1),pointer(i));
 get_mem(pointer(w2),pointer(s1));
 get_mem(pointer(w3),pointer(s2));

 s1.s1:=convert_value;
 s1.s2:=convert_value;
 s1.s3:=convert_value;
 s1.s4:=convert_value;

 b:=gettickcount;
 test_convert_sse(i,s2,s1);
 d:=gettickcount-b;

 freemem(w1);
 freemem(w2);
 freemem(w3);

 showmessage('FPU:'+inttostr(c)+'/SSE:'+inttostr(d));
end.

Please post the assembly code generated by the fast and slow version. That makes it much more easier to find the culprint because few people here are using pascal and can easily recreate your szenario. — Nils Pipenbrinck, Dec 23 '14 at 19:45
Hi and thanks for the interest. If you check the posted source for the functions called convert_x87 and convert_sse you'll see they are in assembler, you should be able to copy paste them around. Only the timing part is in delphi. — Marladu, Dec 23 '14 at 19:47
The function called convert_sse does 4 convert+multiply at a time and is slower or same speed as the convert_fpu function that does 1 at a time called 4 times. The instructions used in the convert_sse function are CVTDQ2PS mulps movaps, are they not the proper SIMD instructions for this task? — Marladu, Dec 23 '14 at 19:57
@Marladu the surrounding code (loop) is very important. Simple three line assembler codes can completely be ruined by whatever the compiler does. I learned that the hard way. — Nils Pipenbrinck, Dec 23 '14 at 19:58
Oh alright, I didn't understand what you meant, apologies. I'll edit the question with the loop assembler instructions in ~30 minutes. — Marladu, Dec 23 '14 at 20:00
I edited the question taking into account the comments. Turns out that with a static data set SSE is much faster then FPU, but if data is changed on every loop iteration then they are the same speed — Marladu, Dec 23 '14 at 21:00
How big is your data set? if you constantly read from memory, it's quite possible that both cases are memory-bound, so the method of execution doesn't matter — Leeor, Dec 23 '14 at 21:25
Hi, the data set is one single for the FPU test and 4 singles for the SSE test, I meant that the data is changed on every iteration of the loop, you can see it in the test_convert_fpu/sse functions by uncommenting the increments. Unexpectedly, it appears that increasing 4 memory locations using base x86 instructions is significantly slower then using SSE to read those same 4 memory locations, convert them from integer to single, multiply them, and store them back. — Marladu, Dec 23 '14 at 21:36
In first program, comment out calls to `convert_sse` and `convert_x87` and you will see that x87 variant is much faster. All that code is then doing is counting to `2**32`. Indeed, the sse variant spends 25% of its time counting. A much greater percentage than the x87 version. My take on this is that the FP part of your code is insignificant. Are you 100% sure that this is your bottleneck? Only if you do nothing else in your program than convert from integer to float could you expect to improve perf. What percentage of time is spent in real program doing the conversion? — David Heffernan, Dec 23 '14 at 22:39
What memory aligment do you use? I don't have direct expirience of working with SSE instructions but I do remember reading somewhere that you nee to use proper memory aligment for data that you do feed to SSE instructions if you wanna get most performance out of it. — SilverWarior, Dec 23 '14 at 23:54
For the memory alignment it's 16 bytes for SSE which is why the memory blocks are obtained from memory allocations since delphi can't align global data on 16 bytes boundaries without some tricks. — Marladu, Dec 24 '14 at 00:17
David, you are right that this isn't the bottleneck in my client's application, I just made a very quick and dirty simulation or changing data values that are being transformed and this is what I posted. The contract I'm doing is to improve an old application that has as requirement no greater then SSE2 instructions, which means some old computers, and every cycle saved counts on 10 year old computers. — Marladu, Dec 24 '14 at 00:22
If you want to take a few minutes to write something clever about what you've observed in this question as an answer I would accept it since I consider this resolved (SSE much faster then FPU in optimal situations but creating optimal situations can be challenging). — Marladu, Dec 24 '14 at 00:23

score 1 · Answer 1 · answered Jun 26 '15 at 04:33

The main thing that looks slow about your asm is not keeping stuff in registers. 4 inc of 4 successive memory locations is insane, no wonder it was slow. Esp. if you're just going to read them back from memory again the next time. Get your loop-counter vector set up outside the loop, and then increment it by adding a vector of { 1, 1, 1, 1 } to it.

You question also doesn't have any reminders about what the 32bit-windows calling conventions are (which arg goes in which register), so I had to figure that out from looking at your function arg variable names vs. how you use them.

So your inner loop can be something like:

; *untested*
    movdqa xmm1, [ vector_of_ones ]   ; or pcmpgt same,same -> all 1s, packed right shift by 32bits
    xor ebx, ebx  ; loop counter
;  also broadcast the scale value to xmm4, maybe with shufps
    movdqa   xmm2, [eax]   ; values to be incremented and converted
loop:
    cvtdq2ps xmm0, xmm2
    mulps    xmm0, xmm4  ; scale
    movaps   [edx], xmm0
    paddd    xmm2, xmm1  ; increment counters
    sub      ebx, 4
    jne      loop  ; loop 2^32 times

    ; movdqa    [eax], xmm2   ; store the incremented loop counter?
    ;  Not sure if this was desired, or a side effect of using mem instead of regs.
    ; If you want this to work on an array, put this store in the loop
    ; and use an indexed addressing mode for eax and edx (or increment pointers)

If this is for a function that isn't going to loop, then setting up the scale vector for mulps is different. Ideally the scale arg should be passed in the low element of a vector register, and you broadcast it from there with shufps or something. If delphi forces it to come as in memory pointed to by a GP register, then movss first, I guess. If it's a compile-time constant, it using a 16B vector constant as a memory operand to mulps is probably the way to go. Core2 and later only take a single cycle for 128b loads. (It does need to be aligned, though, for non-AVX vector stuff on old CPUs.)

Anyway, I think the main thing that was slow with your benchmark was memory access, especially writes. Only one store per cycle is possible. If delphi can't pass float args in registers, that sucks.

Hi peter, thanks for your time and effort. It's been a long while since I posted this question and I think I was benchmarking my attempts to learn, and ended up making a very poorly designed benchmark. I will read up this answer properly this week-end, I still have very much to learn on the subject, so thakns again for taking time to answer. — Marladu, Jul 03 '15 at 04:34
If you're coming back to this, try to write a whole inner loop in asm. In real use, as well as benchmarking, a non-inlined function call, or data movement for calling conventions, can dominate the time taken by the actual function body. — Peter Cordes, Jul 03 '15 at 04:38

SSE: Mass integer conversion+multiply slower with SSE than FPU?

1 Answers1