There are a lot of architectures that can do such operations in a single instruction. For example a*2 + b
compiles to
lea eax, [rsi+rdi*2]
on x86-64
add r0, r1, r0, lsl #1
on ARM
add w0, w1, w0, lsl 1
on ARM64
lda16 r0, r1[r0]
on xcore
The compiler will optimize the expression appropriately. There's no reason to do such things as a *= 2; a += b
which in many cases reduces readability
You can see the demo on Compiler Explorer
However if you ask that just because you do this operation billions times then this is essentially an XY problem because changing the C version isn't the right way to do, and reducing the number of instructions isn't how you reduce runtime. You don't measure performance by instruction count
Modern CPUs are superscalar and some instructions are microcoded, so a single complex instruction may be slower than multiple simple instructions that can execute in parallel. Compilers obviously know this and will take latency into account while compiling. The real solution is to use multithreading and SIMD
For example Clang emits the following instructions in the main loop for AVX-512
vpaddd zmm0, zmm0, zmm0 ; a *= 2
vpaddd zmm1, zmm1, zmm1
vpaddd zmm2, zmm2, zmm2
vpaddd zmm3, zmm3, zmm3
vpaddd zmm0, zmm0, zmmword ptr [rsi + 4*rdx] ; a += b
vpaddd zmm1, zmm1, zmmword ptr [rsi + 4*rdx + 64]
vpaddd zmm2, zmm2, zmmword ptr [rsi + 4*rdx + 128]
vpaddd zmm3, zmm3, zmmword ptr [rsi + 4*rdx + 192]
which involves both loop-unrolling and auto vectorization. Each instruction can work on sixteen 32-bit integers at a time. Of course if you use 64-bit int
then it can work on "only" 8 at a time. Besides, each of the same instructions can be done independent of others, so if a CPU has enough execution ports it can add 64 int
s in parallel. Now that's what we call "fast"
GCC is often less aggressive at loop unrolling and uses vpslld
followed by a vpaddd
. But that's still faster than a scalar version. On ARM with neon you can see shl v0.4s, v0.4s, 1; add v0.4s, v0.4s, v1.4s
is used. Here's the Compiler Explorer demo link
Combining with multithreading that's hugely faster than your "optimization"