What's an example of a simple C function which is faster implemented in inline assembly?

Question

I'm having a hard time beating my compiler using inline assembly.

What's a good, non-contrived examples of a function which the compiler has a hard time making really, really fast and simple? But that's relatively simple to make with inline assembly.

Not to pick on you, but there are an awful lot of people on SO asking optimization and speed questions, and very few saying they need it because they are not meeting requirements. Apparently we haven't beat in the "premature optimization is the root of all evil" mantra enough :) — Chris Arguin, Jul 16 '09 at 17:30
What prompted my questions was that I was dicking around with inline assembly on the iPhone and was going to write up a blog post about it. But I couldn't for the life of me outdo my compiler. So I got curious to see whether there are known edge cases where compilers produce inefficient code. — Hans Sjunnesson, Jul 16 '09 at 18:42
ARM assembly is one of the "cleaner" instruction sets. Part of the philosophy of RISC processors is to not add instructions that aren't easily used by the compiler. You would have to look at the instruction set of particular ARM variant and find opcodes that don't have a clear C translation. — NoMoreZealots, Jul 16 '09 at 18:54
"Premature optimization is the root of all evil" doesn't apply to dicking around, or learning. It's worth writing assembly if for no other reason than to challenge yourself. You rarely come accross the cases where you do need it, but aids in your understand of the compiler generated code. — NoMoreZealots, Jul 16 '09 at 19:33
""Premature optimization is the root of all evil" doesn't apply to dicking around, or learning. It's worth writing assembly if for no other reason than to challenge yourself." AMEN TO THAT! — Olof Forshell, Dec 16 '12 at 09:03

Falaina · Answer 1 · 2009-07-16T19:02:50.340

If you don't consider SIMD operations cheating, you can usually write SIMD assembly that performs much better than your compilers autovectorization abilities (If it even has autovectorization!)

Here's a very basic SSE(One of x86's SIMD instruction sets) tutorial. It's for Visual C++ in-line assembly.

Edit: Here's a small pair of functions if you want to try for yourself. It's the calculation of an n length dot product. One is using SSE 2 instructions in-line (GCC in-line syntax) the other is very basic C.

It's very very simple and I'd be very surprised if a good compiler couldn't vectorize the simple C loop, but if it doesn't you should see a speed up in the SSE2. The SSE 2 version could probably be faster if I used more registers but I don't want to stretch my very weak SSE skills :).

 float dot_asm(float *a, float*b, int n)
{
  float ans = 0;
  int i; 
  // I'm not doing checking for size % 8 != 0 arrays.
  while( n > 0) {
    float tmp[4] __attribute__ ((aligned(16)));

     __asm__ __volatile__(
            "xorps      %%xmm0, %%xmm0\n\t"
            "movups     (%0), %%xmm1\n\t"
            "movups     16(%0), %%xmm2\n\t"
            "movups     (%1), %%xmm3\n\t"
            "movups     16(%1), %%xmm4\n\t"
            "add        $32,%0\n\t"
            "add        $32,%1\n\t"
            "mulps      %%xmm3, %%xmm1\n\t"
            "mulps      %%xmm4, %%xmm2\n\t"
            "addps      %%xmm2, %%xmm1\n\t"
            "addps      %%xmm1, %%xmm0"
            :"+r" (a), "+r" (b)
            :
            :"xmm0", "xmm1", "xmm2", "xmm3", "xmm4");

    __asm__ __volatile__(
        "movaps     %%xmm0, %0"
        : "=m" (tmp)
        : 
        :"xmm0", "memory" );             

   for(i = 0; i < 4; i++) {
      ans += tmp[i];
   }
   n -= 8;
  }
  return ans;
}

float dot_c(float *a, float *b, int n) {

  float ans = 0;
  int i;
  for(i = 0;i < n; i++) {
    ans += a[i]*b[i];
  }
  return ans;
}

SIMD is definitely not cheating. It provides a clear case of where compilers haven't kept up with hardware. C doesn't handle instruction level parallelizism well. Maybe it can unrolling loops here and there, but more advance routines need serious tweaking. — NoMoreZealots, Jul 16 '09 at 19:55
There are plenty of compilers that will output SIMD instructions. — jrockway, Jul 16 '09 at 20:58
They will, for limited cases. Basically as long as your code is written with a common technique or algorithm. Once the instruction set grows too big, optimal use of many instructions start to get lost in wash when writting a compiler or optimizer simply due to complexity. This was large part of the basis for the "RISC" processor concept. Optimization is simalar to chess, a computer can beat the majority people, but it takes a much more than a desktop to beat a grand master. — NoMoreZealots, Jul 16 '09 at 21:35

score 7 · Accepted Answer · answered Jul 28 '09 at 21:24

Since it's related to the iPhone and assembly code then I'll give an example that would be relevant in iPhone world (and not some sse or x86 asm). If anybody decides to write assembly code for some real world app, then most likely this is going to be some sort of digital signal processing or image manipulation. Examples: converting colorspace of RGB pixels, encoding images to jpeg/png format, or encoding sound to mp3, amr or g729 for voip applications. In case of sound encoding there are many routines that cannot be translated by the compiler to efficient asm code, they simply have no equivalent in C. Examples of the commonly used stuff in sound processing: saturated math, multiply-accumulate routines, matrix multiplication.

Example of saturated add: 32-bit signed int has range: 0x8000 0000 <= int32 <= 0x7fff ffff. If you add two ints result could overflow, but this could be unacceptable in certain cases in digital signal processing. Basically, if result overflows or underflows saturated add should return 0x8000 0000 or 0x7fff ffff. That would be a full c function to check that. an optimized version of saturated add could be:

int saturated_add(int a, int b)
{
    int result = a + b;

    if (((a ^ b) & 0x80000000) == 0)
    {
        if ((result ^ a) & 0x80000000)
        {
            result = (a < 0) ? 0x80000000 : 0x7fffffff;
        }
    }
    return result;
}

you may also do multiple if/else to check for overflow or on x86 you may check overflow flag (which also requires you to use asm). iPhone uses armv6 or v7 cpu which have dsp asm. So, the saturated_add function with multiple brunches (if/else statements) and 2 32-bit constants could be one simple asm instruction that uses only one cpu cycle. So, simply making saturated_add to use asm instruction could make entire algorithm two-three times faster (and smaller in size). Here's the QADD manual: QADD

other examples of code that often executed in long loops are

res1 = a + b1*c1;
res2 = a + b2*c2;
res3 = a + b3*c3;

seems like nothing can't be optimized here, but on ARM cpu you can use specific dsp instructions that take less cycles than to do simple multiplication! That's right, a+b * c with specific instructions could execute faster than simple a*b. For this kind of cases compilers simply cannot understand logic of your code and can't use these dsp instructions directly and that's why you need to manually write asm to optimize code, BUT you should only manually write some parts of code that do need to be optimized. If you start writing simple loops manually then almost certainly you won't beat the compiler! There are multiple good papers on the web for inline assembly to code fir filters, amr encoding/decoding etc.

score 6 · Answer 3 · answered Jul 16 '09 at 17:40

6

Unless you are an assembly guru the odds of beating the compiler are very low.

A fragment from the above link,

For example, the bit-oriented "XOR %EAX, %EAX" instruction was the fastest way to set a register to zero in the early generations of the x86, but most code is generated by compilers and compilers rarely generated XOR instruction. So the IA designers, decided to move the frequently occurring compiler generated instructions up to the front of the combinational decode logic making the literal "MOVL $0, %EAX" instruction execute faster than the XOR instruction.

answered Jul 16 '09 at 17:40

Nick Dandoulakis

42,588
16
104
136

4

I'm not a an assembly guru, and I've beat the compiler. I very rarely resort to assembly. It was a last resort when I had to. This just seems like nay-saying. And it ignores his question. He admits it isn't easy in the question. – NoMoreZealots Jul 16 '09 at 23:27
1

I didn't say it's impossible. If you grok the instruction set you can try to write faster code or to squeeze the routine to fewer instructions. If you have a not very sophisticated compiler or the compiler doesn't handle the sse, 3dnow sets, writing assembly might be the *proper* way to implement some routines. – Nick Dandoulakis Jul 17 '09 at 03:07
1

You are right, understanding the instruction set is a absolute necessity if you want to have any hope of beating a complier. But even with a good compiler, you can find instructions that don't have C constructs that map well to them on modern architectures. There are still "gaps" in the abstractions that are just growing larger as the multicore paradigm becomes the norm. And in today's power conscious and mobile driven market, we can't assume a faster cpu core speeds in our applications. CPUs hit 1GHz in 1999, and new apps are running on the "hottest" hard are clocking at 400Mhz today. – NoMoreZealots Jul 17 '09 at 12:36
1

By "Hottest" hardware, I mean things like the IPhone, and what not. Battery life makes the tradeoffs between efficiency and development time tilt in a completely new direction. – NoMoreZealots Jul 17 '09 at 12:42
1

Pete, I don't argue. And these *gaps* are another example of the *leaky abstraction* notion :), http://en.wikipedia.org/wiki/Leaky_abstraction – Nick Dandoulakis Jul 17 '09 at 12:42
1

When it comes to how languages map to hardware the issue is more fundomental than general software development. As much as I like C and C++, very few languages capture the notion parallelizism well. I would really like to see more Computer Science influence and cross-over in the hardware description languages such as VHDL. They do parallelizism well, but suck at abstractions. Cross pollenization would help both sides of the fence. – NoMoreZealots Jul 17 '09 at 13:36

score 5 · Answer 4 · edited Jul 16 '09 at 20:55

I implemented a simple cross correlation using a generic "strait C" implementation. And THEN when it took longer than the timeslice I had available, I resorted to explicit parallelization of the algorithm and using processor intrinsic to force the specific instructions to be used in the calculations. For this particular case, the computation time was reduce from >30ms to just over 4ms. I had a 15ms window to complete processing before the next data acquisition occurred.

This was a SIMD type optimization on a VLWI processor. This only require 4 or so of the processor intrinsics, which are basically assembly language instructions that give the appearance of a function call in the source code. You could do the same with inline assembly but the syntax and register management is a little nicer with processor intrinsics.

Other than that if size matters, assembler is king. I went to school with a guy who wrote a full screen text editor in less than 512 bytes.

This is a classic case where assembler is sensible. The code was written in C; worked, but not fast enough. Recoding in assembler made it work fast enough - that was a good reason to drop into assembler. — Jonathan Leffler, Jul 16 '09 at 20:57
I was dissappointed at the performance I got out of the strait C version, the chip vender's propaganda bragged about how good their C compiler was. And they're most recent toolchain doesn't do any better job optimizing it either. Unfortunately DSPs with VLWI aren't easy to write an optimizer for. — NoMoreZealots, Jul 16 '09 at 21:22

AShelly · Answer 5 · 2009-07-16T21:26:46.350

I have an checksum algorithm which requires words to be rotated by a certain number of bits. To implement it, I've got this macro:

//rotate word n right by b bits
#define ROR16(n,b) (((n)>>(b))|(((n)<<(16-(b)))&0xFFFF))

//... and inside the inner loop: 
sum ^= ROR16(val, pos);

VisualStudio release build expands to this: (val is in ax, pos is in dx, sum is in bx)

mov         ecx,10h 
sub         ecx,edx 
mov         ebp,eax 
shl         ebp,cl 
mov         cx,dx 
sar         ax,cl 
add         esi,2 
or          bp,ax 
xor         bx,bp

The more efficient equivalent hand-generated assembly would be:

 mov       cl,dx
 ror       ax,cl
 xor       bx,ax

I haven't figured out how to emit the ror instruction from pure 'c' code. However...
While writing this up, I remembered compiler intrinsics. I can generate the second set of instructions with:

sum ^= _rotr16(val,pos);

So my answer is: Even if you think you can beat the pure c compiler, check the intrinsics before resorting to inline assembly.

I tried this in gcc (4.0.1) with -O4. It does output a ROR instruction for a 32-bit rotate, but not for 16 bits. — finnw, Jul 17 '09 at 11:17

score 2 · Answer 6 · answered Jul 16 '09 at 17:28

2

If you want to do stuff like SIMD operations, you might be able to beat a compiler. This will require good knowledge of the architecture and the instruction set though.

answered Jul 16 '09 at 17:28

samoz

56,849
55
141
195

You really can't understate the importance of understanding the architecture and instruction set when dealing with assembly. I typically avoid asm, but I still make it point to learn the capabilies of the architecture so I can have some idea of the theoretical performance available. – NoMoreZealots Jul 16 '09 at 18:49

score 0 · Answer 7 · answered Jul 16 '09 at 17:27

My best win out over a compiler was on a simple memcpy routine... I skipped a lot of the basic setup stuff ( e.g., I didn't need much of a stack frame, so I save a few cycles there ), and did a few pretty hairy things.

That was about 6 years ago, with some proprietary compiler of unknown quality. I'll have to dig up the code I had and try it against GCC now; I don't know that it could get any faster, but I wouldn't rule it out.

In the end, even though my memcpy was on average about 15x faster than the one in our C library, I just kept it in my back pocket in case I needed it. It was a toy for me to play with PPC assembly, and the speed boost wasn't necessary in our application.

What's an example of a simple C function which is faster implemented in inline assembly?

7 Answers7