Since it's related to the iPhone and assembly code then I'll give an example that would be relevant in iPhone world (and not some sse or x86 asm).
If anybody decides to write assembly code for some real world app, then most likely this is going to be some sort of digital signal processing or image manipulation. Examples: converting colorspace of RGB pixels, encoding images to jpeg/png format, or encoding sound to mp3, amr or g729 for voip applications.
In case of sound encoding there are many routines that cannot be translated by the compiler to efficient asm code, they simply have no equivalent in C. Examples of the commonly used stuff in sound processing: saturated math, multiply-accumulate routines, matrix multiplication.
Example of saturated add: 32-bit signed int has range: 0x8000 0000 <= int32 <= 0x7fff ffff. If you add two ints result could overflow, but this could be unacceptable in certain cases in digital signal processing. Basically, if result overflows or underflows saturated add should return 0x8000 0000 or 0x7fff ffff. That would be a full c function to check that.
an optimized version of saturated add could be:
int saturated_add(int a, int b)
{
int result = a + b;
if (((a ^ b) & 0x80000000) == 0)
{
if ((result ^ a) & 0x80000000)
{
result = (a < 0) ? 0x80000000 : 0x7fffffff;
}
}
return result;
}
you may also do multiple if/else to check for overflow or on x86 you may check overflow flag (which also requires you to use asm). iPhone uses armv6 or v7 cpu which have dsp asm. So, the saturated_add
function with multiple brunches (if/else statements) and 2 32-bit constants could be one simple asm instruction that uses only one cpu cycle.
So, simply making saturated_add to use asm instruction could make entire algorithm two-three times faster (and smaller in size). Here's the QADD manual:
QADD
other examples of code that often executed in long loops are
res1 = a + b1*c1;
res2 = a + b2*c2;
res3 = a + b3*c3;
seems like nothing can't be optimized here, but on ARM cpu you can use specific dsp instructions that take less cycles than to do simple multiplication! That's right, a+b * c with specific instructions could execute faster than simple a*b. For this kind of cases compilers simply cannot understand logic of your code and can't use these dsp instructions directly and that's why you need to manually write asm to optimize code, BUT you should only manually write some parts of code that do need to be optimized. If you start writing simple loops manually then almost certainly you won't beat the compiler!
There are multiple good papers on the web for inline assembly to code fir filters, amr encoding/decoding etc.