6

I am writing a Linux Kernel driver (for ARM) and in an irq handler I need to check the interrupt bits.

bit
 0/16  End point 0 In/Out interrupt
       (very likely, while In is more likely)
 1/17  End point 1 In/Out interrupt
 ...
15/31  End point 15 In/Out interrupt

Note that more than a bit can be set at a time.

So this is the code:

int i;
u32 intr = read_interrupt_register();

/* ep0 IN */
if(likely(intr & (1 << 0))){
    handle_ep0_in();
}

/* ep0 OUT */
if(likely(intr & (1 << 16))){
    handle_ep0_out();
}

for(i=1;i<16;++i){
    if(unlikely(intr & (1 << i))){
        handle_ep_in(i);
    }
    if(unlikely(intr & (1 << (i + 16)))){
        handle_ep_out(i);
    }
}

(1 << 0) and (1 << 16) would be calculated in compile time, however (1 << i) and (1 << (i + 16)) wouldn't. Also there would be integral comparison and addition in the loop.

Because it is an irq handler, work should be done within the shortest time. This let me think whether I need to optimize it a bit.

Possible ways?

1. Split the loop, seems to make no difference...

/* ep0 IN */
if(likely(intr & (1 << 0))){
    handle_ep0_in();
}

/* ep0 OUT */
if(likely(intr & (1 << 16))){
    handle_ep0_out();
}

for(i=1;i<16;++i){
    if(unlikely(intr & (1 << i))){
        handle_ep_in(i);
    }
}
for(i=17;i<32;++i){
    if(unlikely(intr & (1 << i))){
        handle_ep_out(i - 16);
    }
}

2. Shift intr instead of the value to be compared to?

/* ep0 IN */
if(likely(intr & (1 << 0))){
    handle_ep0_in();
}

/* ep0 OUT */
if(likely(intr & (1 << 16))){
    handle_ep0_out();
}

for(i=1;i<16;++i){
    intr >>= 1;
    if(unlikely(intr & 1)){
        handle_ep_in(i);
    }
}
intr >>= 1;
for(i=1;i<16;++i){
    intr >>= 1;
    if(unlikely(intr & 1)){
        handle_ep_out(i);
    }
}

3. Fully unroll the loop (not shown). That would make the code a bit messy.

4. Any other better ways?

5. Or it's that the compiler will actually generate the most optimized way?


Edit: I was looking for a way to tell the gcc compiler to unroll that particular loop, but it seems that it isn't possible according to my search...

Alvin Wong
  • 12,210
  • 5
  • 51
  • 77
  • You have only 17 elements to handle. Manually unrolled it is not messier than the code in your first example – fork0 Sep 13 '12 at 07:28

2 Answers2

5

If we can assume that the number of set bits in intr is low (as it is usually the case in interrupt masks) we can optimize a little bit and write a loop that executes for each bit only once:

void handle (int intr)
{
  while (intr)
  {
    // find index of lowest bit set in intr:
    int bit_id = __builtin_ffs(intr)-1;

    // call handler:
    if (bit_id > 16)
      handle_ep_out (bit_id-16);
    else
      handle_ep_in (bit_id);

    // clear that bit
    // (I think there was a bit-hack out there to simplify this step even further)
    intr -= (1<<bit_id);
  }
}

On most ARM architectures __builtin_ffs will compile down to a CLZ instruction and some arithmetic around it. It should do so for anything but ARM7 and older cores.

Also: When writing interrupt handlers on embedded devices the size of the function makes a difference for performance as well because the instructions have to be loaded into the code-cache. Lean code usually executes faster. A bit overhead is okay if you save memory accesses to memory that is unlikely to be in the cache.

Nils Pipenbrinck
  • 83,631
  • 31
  • 151
  • 221
  • You missed the special cases for the argument-less functions `handle_ep0_in` and `handle_ep0_out`, but +1 for ffs – fork0 Sep 13 '12 at 07:48
  • Also I don't know if __builtin_ffs is allowed in the kernel, but they very likely have some replacement for it if it is not allowed. – Nils Pipenbrinck Sep 13 '12 at 07:49
  • Why would it not be allowed. If it by some chance is not, you can use clz via inline asm directly. – dbrank0 Sep 13 '12 at 07:54
  • @nils-pipenbrinck: Yes, ffs, with arch-specific versions if available in `__ffs`. – fork0 Sep 13 '12 at 07:55
  • @dbrank0: often the reason for such restrictions is the compiler support library, which is not used in the kernel. Search for "udivdi3" here. – fork0 Sep 13 '12 at 07:57
  • I just checked the code compiled with -O3 for cortex-a8.. Looks neat. 10 instructions in the loop-body including the loop control and the branches to the handle_op_in/out functions. – Nils Pipenbrinck Sep 13 '12 at 08:00
  • I think my interrupt handler cannot be small indeed. There is only a single irq for that device, and there are more than 50 interrupts to handle inside it. The code shown above is only some of them. – Alvin Wong Sep 13 '12 at 08:10
  • Just found [`ffs`](http://lxr.free-electrons.com/source/arch/arm/include/asm/bitops.h?v=2.6.35;a=arm#L294) in LXR. But what is the difference of `ffs(x)` and `__ffs(x)`? `#define __ffs(x) (ffs(x) - 1)` – Alvin Wong Sep 13 '12 at 08:12
  • It comes out to be the same but for the logic of it I would go for `intr ^= (1< – Jens Gustedt Sep 13 '12 at 09:08
1

I would probably go for option 5 myself. Code for readability and let gcc's insane optimisation level -O3 do what it can.

I've seen code generated at that level that I can't even understand.

Any hand-crafted optimisation in C (other than possibly unrolling and using constants rather than runtime bit shifts, a la option 3) is unlikely to outperform what the compiler itself can do.

I think you'll find that the unrolling may not be as messy as you think:

if (  likely(intr & 0x00000001)) handle_ep0_in();
if (  likely(intr & 0x00010000)) handle_ep0_out();

if (unlikely(intr & 0x00000002)) handle_ep_in(1);
if (unlikely(intr & 0x00020000)) handle_ep_out(1);

:

if (unlikely(intr & 0x00008000)) handle_ep_in(15);
if (unlikely(intr & 0x80000000)) handle_ep_out(15);

In fact, you can make it a lot less messier with macros (untested, but you should get the general idea):

// Since mask is a constant, "mask << 32" should be too.

# define chkintr (mask, num) \
    if (unlikely(intr & (mask      ))) handle_ep_in  (num); \
    if (unlikely(intr & (mask << 32))) handle_ep_out (num);

// Special case for high probability bit.

if (likely(intr & 0x00000001UL)) handle_ep0_in();
if (likely(intr & 0x00010000UL)) handle_ep0_out();

chkintr (0x0002UL,  1);  chkintr (0x0004UL,  2);  chkintr (0x0008UL,  3);
chkintr (0x0010UL,  4);  chkintr (0x0020UL,  5);  chkintr (0x0040UL,  6);
chkintr (0x0080UL,  7);  chkintr (0x0100UL,  8);  chkintr (0x0200UL,  9);
chkintr (0x0400UL, 10);  chkintr (0x0800UL, 11);  chkintr (0x1000UL, 12);
chkintr (0x2000UL, 13);  chkintr (0x4000UL, 14);  chkintr (0x8000UL, 15);

The only step up from there is hand-coding assembly language and there's still the good possibility that gcc may be able to outperform you :-)

paxdiablo
  • 854,327
  • 234
  • 1,573
  • 1,953
  • Well I may me over-worried because it wouldn't really use too many time, but I still don't want to lie to myself :P. Also I think that Linux Kernel by default builds with optimization level 2. – Alvin Wong Sep 13 '12 at 07:36
  • Level 2 may well be enough. Certainly, there are things in the kernel already with pretty stringent timing requirements, so maybe `-O2` is enough. `-O3` would probably make kernel debugging quite a nightmare. Bottom line advice, don't worry about this until you _know_ it's a problem. Both the loop and the unrolled form will probably be more than fast enough. – paxdiablo Sep 13 '12 at 07:39
  • well, lets see if there would be more answers. – Alvin Wong Sep 13 '12 at 07:44