1

I'm trying to optimize a cube function using SSE

long cube(long n)
{
    return n*n*n;
}

I have tried this :

return (long) _mm_mul_su32(_mm_mul_su32((__m64)n,(__m64)n),(__m64)n);

And the performance was even worse (and yes I have never done anything with sse).

Is there a SSE function which could increase the performance? Or something else?

output from cat /proc/cpuinfo


processor   : 0
vendor_id   : GenuineIntel
cpu family  : 6
model       : 15
model name  : Intel(R) Xeon(R) CPU            3070  @ 2.66GHz
stepping    : 6
cpu MHz     : 2660.074
cache size  : 4096 KB
physical id : 0
siblings    : 2
core id     : 0
cpu cores   : 2
apicid      : 0
initial apicid  : 0
fpu     : yes
fpu_exception   : yes
cpuid level : 10
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm lahf_lm tpr_shadow
bogomips    : 5320.14
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

processor   : 1
vendor_id   : GenuineIntel
cpu family  : 6
model       : 15
model name  : Intel(R) Xeon(R) CPU            3070  @ 2.66GHz
stepping    : 6
cpu MHz     : 2660.074
cache size  : 4096 KB
physical id : 0
siblings    : 2
core id     : 1
cpu cores   : 2
apicid      : 1
initial apicid  : 1
fpu     : yes
fpu_exception   : yes
cpuid level : 10
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm lahf_lm tpr_shadow
bogomips    : 5320.35
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

sherif
  • 2,282
  • 19
  • 21
  • 1
    What's your platform, and did you check that your compiler isn't already outputing something optimal (or close)? – Mat Nov 27 '11 at 12:30
  • I haven't checked it but as far I have heard the gcc compiler doest do that automatically. – sherif Nov 27 '11 at 12:45
  • 1
    SSE is for processing multiple elements (calculating multiple cubes) at once, not for boosting calculations of a single operation – Marat Dukhan Nov 28 '11 at 02:15

3 Answers3

8

I think you have misunderstood when it is useful to use SSE. But I have only used SSE with floating-point types so my experience may not be applicable to this case. I hope you can still learn some bits from what I have written.

SSE provides SIMD, Single Instruction Multiple Data. It is useful when you have many values on which you want to perform the same calculation. It is a kind of small scale parallelization. So instead of doing one multiplication, you can do four at the same time. But it is only useful if you have all dependencies available.

So in your case, there is no room for parallelization. You could write a function that calculated the cube of four floats that would be faster than calling a function that calculated the cube of one number four times.

Mats
  • 8,528
  • 1
  • 29
  • 35
  • I have multiple data its n and n and n also 3 values I want to multiply – sherif Nov 27 '11 at 13:08
  • 1
    @sherif No. You have `n`, `n*n`, and `(n*n)*n`. You cannot calculate the last value without the second. I.e. no parallelization is possible. If you had `s` and `t` and wanted to calculate `s*s*s` and `t*t*t` you could do some of the calculations at the same time. – Mats Nov 28 '11 at 18:52
  • how could I do that ? Im calling the cube function from for statement I probably could produce something like s*s*s and t*t*t – sherif Nov 29 '11 at 11:17
  • I asked a new question concerning (x*x*x)+(y*y*y) http://stackoverflow.com/questions/8357182/multiplication-using-sse-xxxyyy – sherif Dec 02 '11 at 13:40
6

Your code compiles to:

cube:
        movl    4(%esp), %edx
        movl    %edx, %eax
        imull   %edx, %eax
        imull   %edx, %eax
        ret

If inlined the ret and moves will get optimized out, so you have two imul instructions. I doubt mmx or SSE could make this any faster (transfering the data into the mmx / sse registers alone would probably be slower than the two imuls)

Nils Pipenbrinck
  • 83,631
  • 31
  • 151
  • 221
0

You have to align your variables on 16 bytes, for one. Also, in my own experience tinkerin with SSE, you will get significant gains if you compute your function on a whole batch of values... say

cube(long* inArray, long* outArray, size_t size) {
  ...
}
Monkey
  • 1,838
  • 1
  • 17
  • 24
  • the cube function is called from a for statement which iterates over an array the compiler should in-line it there automatically besides Im limited by the memory usage thanks anyway – sherif Nov 27 '11 at 13:17