1

The classical Multiply-Accumulate operation is a = a + b*c. But I currently wonder if there exist an instruction that allows to do the following operations on integer in 1 clock cycle: (a and b are unsigned 64-bit integers: unsigned long long int)

a = a*2-1
a = a*2+b

Currently, I use:

a *= 2
--a

for the first one and

a *= 2
a += b

for the second one. And I think that each one is translated to 2 instructions in ASM. But is there a way to use 1 ASM instruction instead (and with which instruction set extension on Intel CPU)?

(I search that because I do this operation billions times)

phuclv
  • 37,963
  • 15
  • 156
  • 475
Vincent
  • 57,703
  • 61
  • 205
  • 388
  • 4
    Why does it matter how many instructions the compiler generates? This is going to be only loosely related to the number of clock cycles the calculation takes? – CB Bailey Feb 11 '12 at 16:54
  • @KerrekSB, You're right - `lea` can do `a*2+b` if `b` is between 0 and 4096, or you have it in a register. – ugoren Feb 11 '12 at 17:11
  • @Vincent - Current CPUs can execute multiple simple instructions each clock cycle. Removing one doesn't guarantee that the next instruction can fill the gap. You really need a compiler to do the bookkeeping! – Bo Persson Feb 11 '12 at 19:18

2 Answers2

7
  1. For Intel CPU, see the LEA instruction. It can do both of your tasks in one instruction (not sure about cycles though) each. (eg. LEA EAX, [EAX*2+EBX]). Note that this wasn't really meant as a multiply-add, hence its funny name (load effective address).

  2. In C and C++, you shouldn't bother. The compiler will do what it thinks is best and you can probably just hinder its effort. I'd stay with good old a = a*2-1.

PS: If you think something's translated as two instructions, there is nothing easier than looking in the assembly. Then you would know.

jpalecek
  • 47,058
  • 7
  • 102
  • 144
  • Agree. Once upon a time LEA was free because the CPUs had dedicated address calculation units that idled when not used. Not true for the current generations, where it will probably generate the same micro-ops as separate shifts and adds. – Bo Persson Feb 11 '12 at 19:24
  • 1
    `lea eax, [eax*2 + ebx]` is 1 cycle latency on Intel CPUs (scaled index doesn't make it a complex LEA). But on AMD CPUs, a scaled index makes it a complex LEA so it does have 2 cycle latency. Still only 1 uop, though. https://agner.org/optimize/. @BoPersson: LEA is very common, and is worth using because it's *not* microcoded. It's a single uop shift-and-add. But yes it runs on ALU execution units, not the AGUs. Simple LEAs have 2 per clock throughput on Intel SnB-family, vs. 1 per clock for complex LEA. Next-gen Intel is going to have LEA units on all 4 ALU ports. – Peter Cordes Apr 05 '19 at 21:18
1

There are a lot of architectures that can do such operations in a single instruction. For example a*2 + b compiles to

  • lea eax, [rsi+rdi*2] on x86-64
  • add r0, r1, r0, lsl #1 on ARM
  • add w0, w1, w0, lsl 1 on ARM64
  • lda16 r0, r1[r0] on xcore

The compiler will optimize the expression appropriately. There's no reason to do such things as a *= 2; a += b which in many cases reduces readability

You can see the demo on Compiler Explorer


However if you ask that just because you do this operation billions times then this is essentially an XY problem because changing the C version isn't the right way to do, and reducing the number of instructions isn't how you reduce runtime. You don't measure performance by instruction count

Modern CPUs are superscalar and some instructions are microcoded, so a single complex instruction may be slower than multiple simple instructions that can execute in parallel. Compilers obviously know this and will take latency into account while compiling. The real solution is to use multithreading and SIMD

For example Clang emits the following instructions in the main loop for AVX-512

vpaddd  zmm0, zmm0, zmm0                            ; a *= 2
vpaddd  zmm1, zmm1, zmm1
vpaddd  zmm2, zmm2, zmm2
vpaddd  zmm3, zmm3, zmm3
vpaddd  zmm0, zmm0, zmmword ptr [rsi + 4*rdx]       ; a += b
vpaddd  zmm1, zmm1, zmmword ptr [rsi + 4*rdx + 64]
vpaddd  zmm2, zmm2, zmmword ptr [rsi + 4*rdx + 128]
vpaddd  zmm3, zmm3, zmmword ptr [rsi + 4*rdx + 192]

which involves both loop-unrolling and auto vectorization. Each instruction can work on sixteen 32-bit integers at a time. Of course if you use 64-bit int then it can work on "only" 8 at a time. Besides, each of the same instructions can be done independent of others, so if a CPU has enough execution ports it can add 64 ints in parallel. Now that's what we call "fast"

GCC is often less aggressive at loop unrolling and uses vpslld followed by a vpaddd. But that's still faster than a scalar version. On ARM with neon you can see shl v0.4s, v0.4s, 1; add v0.4s, v0.4s, v1.4s is used. Here's the Compiler Explorer demo link

Combining with multithreading that's hugely faster than your "optimization"

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
phuclv
  • 37,963
  • 15
  • 156
  • 475