How to optimize this Delphi function with SSE2?

Question

I need a hint, how to implement this Delphi function using SSE2 assembly (32 Bit). Other optimizations are welcome too. Maybe one can tell me, what kind of instructions could be used, so I have a starting point for further reading.

Actual:

const Precision = 10000;

// This function adds all Pixels into one. The pixels are weighted before adding. 
// A weight can range from 0 to "Precision". "Size" is typically 10 to 50.

function TFilter.Combine(Pixels: PByte; Weights: PCardinal; const Size: Cardinal): Cardinal;
var
  i, R, G, B, A: Cardinal;
begin
  B := Pixels^ * Weights^; Inc(Pixels);
  G := Pixels^ * Weights^; Inc(Pixels);
  R := Pixels^ * Weights^; Inc(Pixels);
  A := Pixels^ * Weights^; Inc(Pixels);
  Inc(Weights); // goto next weight
  for i := 1 to Size - 1 do
  begin
    Inc(B, Pixels^ * Weights^); Inc(Pixels);
    Inc(G, Pixels^ * Weights^); Inc(Pixels);
    Inc(R, Pixels^ * Weights^); Inc(Pixels);
    Inc(A, Pixels^ * Weights^); Inc(Pixels);
    Inc(Weights); // goto next weight
  end;
  B := B div Precision;
  G := G div Precision;
  R := R div Precision;
  A := A div Precision;

  Result := A shl 24 or R shl 16 or G shl 8 or B;
end;

Expected:

function TFilter.Combine(Pixels: PByte; Weights: PCardinal; const Size: Cardinal): Cardinal;
asm
  // Insert fast SSE2-Code here ;-)
end;

I'd look at GR32 and see if it has the routine you need. If not then it's got lots of optimized SSE2 that you could use as a learning resource. — David Heffernan, Apr 12 '12 at 15:00
How many pixels does this combine at once? I ask because if the number is small enough, you won't see any notable speedup because of all the overhead. Also, do the Weight values need to be 32 bits? Will 16 bits contain them? — Multimedia Mike, Apr 13 '12 at 01:06
Weight Values don't have to be 32 Bits as they range only to precision which is 10000 (fits in 16 Bits). — Steffen Binas, Apr 13 '12 at 09:00

score 11 · Accepted Answer · answered Apr 13 '12 at 05:36

11

Rather straightforward implementation. I've changed your function prototype - regular function (against object method).

This code works about 3x times faster than byte-per-byte function (1500 ms for 1000000 iterations on 256-element array, roughly 0.7 GB/sec at my old Athlon XP 2.2 GHz)

function Combine(Pixels: PByte; Weights: PInteger; const Size: Cardinal): Integer;
//x86, register calling convention - three parameters in EAX, EDX, ECX
const
  Precision: Single = 1.0;
asm
  pxor XMM6, XMM6 //zero const
  pxor XMM4, XMM4 // zero accum

@@cycle:
  movd XMM1, [eax] //load color data
  movss XMM3, [edx]  //load weight

  punpcklbw XMM1, XMM6 //bytes to words
  shufps XMM3, XMM3, 0 // 4 x weight
  punpcklwd XMM1, XMM6 //words to ints
  cvtdq2ps XMM2, XMM3  //ints to singles
  cvtdq2ps XMM0, XMM1  //ints to singles

  mulps XMM0, XMM2    //data * weight
  addps XMM4, XMM0    //accum  = accum + data * weight

  add eax, 4        // inc pointers
  add edx, 4
  loop @@cycle

  movss XMM5, Precision
  shufps XMM5, XMM5, 0 // 4 x precision constant

  divps XMM4, XMM5    //accum/precision

  cvtps2dq XMM2, XMM4  //rounding singles to ints
  packssdw XMM2, XMM2 //ints to ShortInts
  packuswb XMM2, XMM2  //ShortInts to bytes

  movd eax, XMM2  //result
end;

answered Apr 13 '12 at 05:36

MBo

77,366
5
53
86

Wow, I didn't expect a fully working version! ;-) And yes, it works like charm. The speed up on my machine (Core i7-920): about 4x times faster! – Steffen Binas Apr 13 '12 at 08:56
Some questions: Why are you using floating point calculcations? Aren't there any possibilities to do it only with integers? I'd think it would be even faster. And if not, I could store the weights as singles so no conversions would be needed. – Steffen Binas Apr 13 '12 at 08:59
About floating point - I don't see SSE2 command for multiplication of 4 packed integer. May be, I've missed it. And yes, it is possible to store the weights as singles (i doubt about significant acceleration) – MBo Apr 13 '12 at 10:35
@Stebi: You should consider to accept the question if it helps you. – menjaraz Apr 14 '12 at 04:25
@Stebi: SSE as SIMD instruction set extension targets floating point data. You should only stick to **MMX** instead of SSE if you want to do it with Integer. – menjaraz Apr 14 '12 at 04:37
@menjaraz There are some integer instructions in SSEx, but their set is limited (for example, no integer division). – MBo Apr 14 '12 at 11:56

How to optimize this Delphi function with SSE2?

1 Answers1

Linked