Increasing performance of 32bit math on 16bit processor

Question

I am working on some firmware for an embedded device that uses a 16 bit PIC operating at 40 MIPS and programming in C. The system will control the position of two stepper motors and maintain the step position of each motor at all times. The max position of each motor is around 125000 steps so I cannot use a 16bit integer to keep track of the position. I must use a 32 bit unsigned integer (DWORD). The motor moves at 1000 steps per second and I have designed the firmware so that steps are processed in a Timer ISR. The timer ISR does the following:

1) compare the current position of one motor to the target position, if they are the same set the isMoving flag false and return. If they are different set the isMoving flag true.

2) If the target position is larger than the current position, move one step forward, then increment the current position.

3) If the target position is smaller than the current position, move one step backward, then decrement the current position.

Here is the code:

void _ISR _NOPSV _T4Interrupt(void)
{
    static char StepperIndex1 = 'A';    

    if(Device1.statusStr.CurrentPosition == Device1.statusStr.TargetPosition)
    {
        Device1.statusStr.IsMoving = 0;
        // Do Nothing
    }   
    else if (Device1.statusStr.CurrentPosition > Device1.statusStr.TargetPosition)
    {
        switch (StepperIndex1)      // MOVE OUT
        {
            case 'A':
                SetMotor1PosB();
                StepperIndex1 = 'B';
                break;
            case 'B':
                SetMotor1PosC();
                StepperIndex1 = 'C';
                break;
            case 'C':
                SetMotor1PosD();
                StepperIndex1 = 'D';
                break;
            case 'D':
                default:
                SetMotor1PosA();
                StepperIndex1 = 'A';
                break;      
        }
        Device1.statusStr.CurrentPosition--;    
        Device1.statusStr.IsMoving = 1;
    }   
    else
    {
        switch (StepperIndex1)      // MOVE IN 
        {
            case 'A':
                SetMotor1PosD();
                StepperIndex1 = 'D';
                break;
            case 'B':
                SetMotor1PosA();
                StepperIndex1 = 'A';
                break;
            case 'C':
                SetMotor1PosB();
                StepperIndex1 = 'B';
                break;
            case 'D':
                default:
                SetMotor1PosC();
                StepperIndex1 = 'C';
                break;      
        }
        Device1.statusStr.CurrentPosition++;
        Device1.statusStr.IsMoving = 1;
    }   
    _T4IF = 0;          // Clear the Timer 4 Interrupt Flag.
}

The target position is set in the main program loop when move requests are received. The SetMotorPos lines are just macros to turn on/off specific port pins.

My question is: Is there any way to improve the efficiency of this code? The code functions fine as is if the positions are 16bit integers but as 32bit integers there is too much processing. This device must communicate with a PC without hesitation and as written there is a noticeable performance hit. I really only need 18 bit math but I don't know of an easy way of doing that! Any constructive input/suggestions would be most appreciated.

What does the compiler generate? Which optimization flags are you using? — nmichaels, Nov 04 '11 at 20:51
As long as you're only adding and subtracting 32-bit values, not multiplying or dividing (especially not dividing), there should be very little impact on performance. Look elsewhere for your delays, they aren't caused by 32-bit addition/increments. — Ben Voigt, Nov 04 '11 at 20:58
@Ben Voigt: What would be the difference in instructions to subtract 32bit numbers vs 16 bit numbers? Would it be twice as many? — PICyourBrain, Nov 04 '11 at 21:08
Depends on whether the instruction set offers an 'add with carry' and a corresponding 'subtract with borrow' instruction or not. If it does, it's just an add and add-with-carry or a sub and a sub-with-borrow, respectively (2 instructions). If not, the borrow must be simulated, which will be a bit more. The other thing is that the compiler might be doing an awful job. Even so, it seems more likely that your switch block will be more expensive than the increment/decrement. — pmdj, Nov 04 '11 at 21:35
What kind of assembly is the compiler producing for those 32-bit math ops? — bdonlan, Nov 04 '11 at 21:56
@JordanS: Probably around three times as many. So 5% of your code takes 3x as long, that'd be a 10% overall slowdown. If you're trying to get an order of magnitude performance increase, you need to look elsewhere. — Ben Voigt, Nov 04 '11 at 22:22
@JordanS: that depends what your `SetMotor1Pos*()` functions do. I suspect you could condense them into one function which accepts a number between 0 and 3 - they probably just write that number into a memory-mapped device register or so. Then you can use `CurrentPosition % 4` to calculate that parameter after incrementing/decrementing `CurrentPosition`. This would prevent the need for the `switch` completely. Either way, I'd verify that this interrupt handler is your bottleneck first. — pmdj, Nov 05 '11 at 11:54

score 6 · Accepted Answer · answered Nov 04 '11 at 22:12

Warning: all numbers are made up...

Supposing that the above ISR has about 200 (likely, fewer) instructions of compiled code and those include the instructions to save/restore the CPU registers before and after the ISR, each taking 5 clock cycles (likely, 1 to 3) and you call 2 of them 1000 times a second each, we end up with 2*1000*200*5 = 2 millions of clock cycles per second or 2 MIPS.

Do you actually consume the rest 38 MIPS elsewhere?

The only thing that may be important here and I can't see it, is what's done inside of the SetMotor*Pos*() functions. Do they do any complex calculations? Do they perform some slow communication with the motors, e.g. wait for them to respond to the commands sent to them?

At any rate, it's doubtful that such simple code would be noticeably slower when working with 32-bit integers than with 16-bit.

If your code is slow, find out where time is spent and how much, profile it. Generate a square pulse signal in the ISR (going to 1 when the ISR starts, going to 0 when the ISR is about to return) and measure its duration with an oscilloscope. Or do whatever is easier to find it out. Measure the time spent in all parts of the program, then optimize where really necessary, not where you have previously thought it would be.

+1 for using an oscilloscope as a profiler, reminds me of good ol' university night hacks. — Albin Sunnanbo, Nov 05 '11 at 06:51

score 1 · Answer 2 · answered Nov 04 '11 at 22:48

1

The difference between 16 and 32 bits arithmetic shouldn't be that big, I think, since you use only increment and comparision. But maybe the problem is that each 32-bit arithmetic operation implies a function call (if the compiler isn't able/willing to do inlining of simpler operations).

One suggestion would be to do the arithmetic yourself, by breaking the Device1.statusStr.CurrentPosition in two, say, Device1.statusStr.CurrentPositionH and Device1.statusStr.CurrentPositionL. Then use some macros to do the operations, like:

#define INC(xH,xL) {xL++;if (xL == 0) xH++;}

answered Nov 04 '11 at 22:48

Fabio Ceconello

15,819
5
38
51

Enumating 32-bit adds on 16-bit hardware is cheap (and a lot cheaper than jumps). If you're worried about performance then multiplications, divisions and floating point are the things to look for, since these are much more expensive to implement in software. – SecurityMatt May 10 '12 at 08:51
The problem I'm referring is not the time to do the math, is the time for the function call. I'd recommend doing that ONLY if the compiled won't do inlining. – Fabio Ceconello May 10 '12 at 13:18
Technically it depends on the compiler and the hardware, but almost nobody would do a 32-bit integer add on 16-bit hardware by doing a function call. On 16-bit x86 it would be done by an ADD followed immediately by an ADC (add with carry). – SecurityMatt May 11 '12 at 03:50
Notice that the question is about embedded devices. In environments with constrained memory it's more common to have function calls for a math lib. – Fabio Ceconello May 11 '12 at 14:00
That doesn't change anything. A function call on 16-bit devices takes at least 3 bytes to encode, whereas a 32-bit add can be done in 2 bytes. Consequently a function call is slower _and_ bigger. – SecurityMatt May 11 '12 at 20:43

score 0 · Answer 3 · answered Nov 04 '11 at 23:09

I would get rid of the StepperIndex1 variable and instead use the two low-order bits of CurrentPosition to keep track of the current step index. Alternately, keep track of the current position in full rotations (rather than each step), so it can fit in a 16 bit variable. When moving, you only increment/decrement the position when moving to phase 'A'. Of course, this means you can only target each full rotation, rather than every step.

GJ. · Answer 4 · 2011-11-05T10:42:08.920

Sorry, but you are using bad program design.

Let's check the difference between 16 bit and 32 bit PIC24 or PIC33 asm code...

16 bit increment

inc    PosInt16               ;one cycle

So 16 bit increment takes one cycle

32bit increment

clr    Wd                     ;one cycle
inc    low PosInt32           ;one cycle
addc   high PosInt32, Wd      ;one cycle

and 32 increment takes three cycles. The total difference is 2 cycles or 50ns (nano seconds).

Simple calcolation will show you all. You have 1000 steps per second and 40Mips DSP so you have 40000 instructions per step at 1000 steps per second. More than enough!

Now compare speed of multiplication. – SecurityMatt May 11 '12 at 03:51 — SecurityMatt, May 11 '12 at 03:51

score 0 · Answer 5 · answered Nov 07 '11 at 12:46

When you change it from 16bit to 32bit do you change any of the compile flags to tell it to compile as a 32bit application instead.

have you tried compiling with the 32bit extensions but using only 16bit integers. do you still get such a performance drop?

It's likely that just by changing from 16bit to 32bit that some operations are compiled differently, perhaps do a Diff between the two sets of compiled ASM code and see what is actually different, is it lots or is it only a couple of lines.

Solutions would be maybe instead of using a 32bit integer, just use two 16bit integers, when the valueA is int16.Max then set it to 0 and then increment valueB by 1 otherwise just incriment ValueA by 1, when value B is >= 3 you then check valueA >= 26696 (or something similar depending if you use unsigned or signed int16) and then you have your motor checking at 12500.

Increasing performance of 32bit math on 16bit processor

5 Answers5