7

NOTE This is a theoretical question. I'm happy with the performance of my actual code as it is. I'm just curious about whether there is an alternative.

Is there a trick to do an integer division of a constant value, which is itself an integer power of two, by an integer variable value, without having to use do an actual divide operation?

// The fixed value of the numerator
#define SIGNAL_PULSE_COUNT 0x4000UL

// The division that could use a neat trick.
uint32_t signalToReferenceRatio(uint32_t referenceCount)
{
    // Promote the numerator to a 64 bit value, shift it left by 32 so
    // the result has an adequate number of bits of precision, and divide
    // by the numerator.
    return (uint32_t)((((uint64_t)SIGNAL_PULSE_COUNT) << 32) / referenceCount);
}

I've found several (lots) of references for tricks to do division by a constant, both integer and floating point. For example, the question What's the fastest way to divide an integer by 3? has a number of good answers including references to other academic and community materials.

Given that the numerator is constant, and it's an integer power of two, is there a neat trick that could be used in place of doing an actual 64 bit division; some kind of bit-wise operation (shifts, AND, XOR, that kind of stuff) or similar?

I don't want any loss of precision (beyond a possible half bit due to integer rounding) greater than that of doing the actual division, as the precision of the instrument relies on the precision of this measurement.

"Let the compiler decide" is not an answer, because I want to know if there is a trick.

Extra, Contextual Information

I'm developing a driver on a 16 bit data, 24 bit instruction word micro-controller. The driver does some magic with the peripheral modules to obtain a pulse count of a reference frequency for a fixed number of pulses of a signal frequency. The required result is a ratio of the signal pulses to the reference pulse, expressed as an unsigned 32 bit value. The arithmetic for the function is defined by the manufacturer of the device for which I'm developing the driver, and the result is processed further to obtain a floating point real-world value, but that's outside the scope of this question.

The micro-controller I'm using has a Digital Signal Processor that has a number of division operations that I could use, and I'm not afraid to do so if necessary. There would be some minor challenges to overcome with this approach, beyond the putting together the assembly instructions to make it work, such as the DSP being used to do a PID function in a BLDC driver ISR, but nothing I can't manage.

Community
  • 1
  • 1
Evil Dog Pie
  • 2,300
  • 2
  • 23
  • 46
  • Even if there is one, I wouldn't use C but assembly. Then you can be sure no optimization will be performed and can program everything just as you want to. – cadaniluk Jan 28 '16 at 13:48
  • 3
    There are no 16 bit ARM cores! And leave optimisation to your compiler. Don't do prmature optimisations. What is the generated Assembler code? And: optimising division, but then using floating point sounds ... inconsistent. – too honest for this site Jan 28 '16 at 13:52
  • 2
    What do you expect this trick to do? What should it give you that normal division does not? – user694733 Jan 28 '16 at 13:57
  • The "trick" is probably to use compile-time constants and then ensure that the function is inlined. The compiler will then be able to do the best optimization from a case-to-case basis. – Lundin Jan 28 '16 at 13:59
  • There is no such trick as far as I know. The compiler probably doesn't know them either. There are many tricks if the divisor is constant, but (apart from trivial cases) not if the numerator is constant. FWIW GCC doesn't seem to know any such trick either. – harold Jan 28 '16 at 14:04
  • @Olaf Thank you. It's 16 bit data, 24 bit instruction word [dsPIC33EPXXX](http://ww1.microchip.com/downloads/en/DeviceDoc/70616g.pdf). I've corrected the question. – Evil Dog Pie Jan 28 '16 at 14:05
  • @MikeofSST: That is still not an **ARM** core! If you still insist, please state the exact name of the core. – too honest for this site Jan 28 '16 at 14:06
  • @cad Assembly may be the way I go, particularly if I do use the DSP. I'll code in C, compile, hand-optimise the intermediate assembly output, and add it to the small-but-sacred-and-unmaintainable-library of assembly routines. – Evil Dog Pie Jan 28 '16 at 14:07
  • @Olaf Thanks again. I've no idea why I thought it was an ARM. At least I've learned something today. :-) – Evil Dog Pie Jan 28 '16 at 14:22
  • @Olaf The floating point is only used at the last moment, because the communication protocol requires that the end value is represented that way. The extra calculation is a polynomial that is evaluated using DSP multiply and accumulate instructions on 32 integers with 64 bit intermediate results. I'm not trying to optimise my code, I'm trying to find out if there is a (probably unmaintainable) alternative. The question is one of interest rather than need. – Evil Dog Pie Jan 28 '16 at 14:31
  • 1
    @user694733 I expect the trick to broaden my mind and make me wonder at the mathematical genius of the people who use SO. In the interest of keeping my code maintainable and my colleagues friendly, I'll almost certainly stick with the division - it's readable and maintainable - unless there is a pure genius answer that is clear, understandable and won't confuse me in two weeks time. My goal is to improve my knowledge, rather than my code. – Evil Dog Pie Jan 28 '16 at 14:37
  • You should regard the lack of answers as a `No`. – Klas Lindbäck Jan 28 '16 at 14:53
  • @KlasLindbäck You could be right. Would 'No' be the shortest correct answer on SO? :-) – Evil Dog Pie Jan 28 '16 at 15:30
  • What is the range of `referenceCount`? The full 32-bit range? – chux - Reinstate Monica Jan 28 '16 at 15:52
  • 1
    Attributes of a "trick" comes down to `1/referenceCount` and composing the fraction scaled by `SIGNAL_PULSE_COUNT`, OP can tolerate a small error, direct `power_of_2/x` is too slow. Suppose `SIGNAL_PULSE_COUNT == 0` is not a concern. Give this post some time. – chux - Reinstate Monica Jan 28 '16 at 16:02
  • @chux In the real world, `referenceCount` will never be bigger then 24 bits. The 'ideal' range is between (approximately) 0x120000 and 0xB40000. The actual range depends on physical, environmental factors such as pressure and temperature, but these will cause the limits of the range to vary by no more than a few hundred. And yes, the 'trick' essentially comes down to finding a reciprocal expressed as an integer, with some pre-defined scaling. – Evil Dog Pie Jan 28 '16 at 16:13
  • @chux Now you've got me thinking about it! Maybe there's a cryptographic method somewhere for doing multiplicative inverses using modulus arithmetic? – Evil Dog Pie Jan 28 '16 at 16:19
  • 2 other ideas: Take advantage of the ratio of the width of the range of `SIGNAL_PULSE_COUNT` is 1:10. 2) Take advantage of previous calculations as, I'm guessing, `SIGNAL_PULSE_COUNT` is not changing too quickly. 3) "lack of answers" does not mean "no answers", Patience young grasshopper. – chux - Reinstate Monica Jan 28 '16 at 16:24
  • FP Ref: http://stackoverflow.com/q/15380564/2410359 – chux - Reinstate Monica Jan 28 '16 at 18:15
  • The closest thing I know to this is http://spiral.ece.cmu.edu/mcm/gen.html. It's not exactly what you want, but maybe you can learn from it. – Jeff Hammond Jan 28 '16 at 20:45
  • some dsp's has intrinsics functions such as _rcpsp of [texas instruments](http://www.ti.com/lit/ug/spru733a/spru733a.pdf) (look for the assembly version RCPSP function). It calculates the 1/x where x is a floating point. Of course it's not what you want :) what i mean that you should check the instruction set of the DSP and look for any instruction that is helpful for you. If you are lucky enough you can even find a C compatible intrinsic version.. – seleciii44 Jan 28 '16 at 22:37
  • @seleciii44 In practice, it will probably be coded as it's written in the question, as it's unlikely to matter much if the calculation takes a few micro seconds. If I really do need to optimise it, I will take your advice and have a more in-depth look at the DSP instruction set, as there may be instructions that I'm not familiar with. (The Microchip compiler does provide some `__builtin_` functions to provide access to DSP instructions from C.) – Evil Dog Pie Jan 29 '16 at 14:08
  • If you can find a "trick", let us know. But the nature of your explanation makes it sound like, even if you do it the "slow" way, it accounts for a very small fraction of execution time. If so, even if you could reduce its cycle count to zero you would not see much improvement. – Mike Dunlavey Jan 29 '16 at 18:24
  • @MikeDunlavey You're correct. I wouldn't see any significant improvement. But I'm not trying to improve my code (which is awesome - ha ha), rather I'm trying to broaden my knowledge using my code as a context; it's a theoretical question. – Evil Dog Pie Jan 29 '16 at 18:31
  • I wrote a fast and inaccurate version for C6400. It depends on a LMBD instruction to detect where the most significant 1 is in an integer. It uses 10 instructions and 1 table look up. That code has an error of a few percent, so I switched to TI's IQMath later, which is much more accurate. – user3528438 Jan 29 '16 at 20:23

4 Answers4

4

You cannot use clever mathematical tricks to not do a division, but you can of course still use programming tricks if you know the range of your reference count:

  • Nothing beats a pre-computed lookup table in terms of speed.
  • There are fast approximate square root algorithms (probably already in your DSP), and you can improve the approximation by one or two Newton-Raphson iterations. If doing the computation with floating-point numbers is accurate enough for you, you can probably beat a 64bit integer division in terms of speed (but not in clarity of code).

You mentioned that the result will be converted to floating-point later, it might be beneficial to not compute the integer division at all, but use your floating point hardware.

planckh
  • 156
  • 1
  • This is correct. It just feels too easy. I have the shape of a 'clever trick' (which is probably wrong) forming in my mind. Something about using a modular multiplicative inverse with the modulus being the nearest prime to `SIGNAL_PULSE_COUNT` and then using the special case of Euler's theorem to arrive at a close approximation ... – Evil Dog Pie Jan 29 '16 at 18:27
  • I meant of course [fast approximate **reciprocal** square root](https://en.wikipedia.org/wiki/Fast_inverse_square_root). You should definitely check that method if you are interested in magic that computes reciprocals. – planckh Feb 01 '16 at 10:05
1

I worked out a Matlab version, using fixed point arithmetic.

This method assumes that a integer version of log2(x) can be calculated efficiently, which is true for dsPIC30/33F and TI C6000 that have instruction to detect the most significant 1 of an integer.

For this reason, this code has strong ISA depency and can not be written in portable/standard C and can be improved using instructions like multiply-and-add, multiply-and-shift, so I won't try translating it to C.

nrdiv.m

function [ y ] = nrdiv( q, x, lut) 
                          % assume q>31, lut = 2^31/[1,1,2,...255]
    p2 = ceil(log2(x));   % available in TI C6000, instruction LMBD
                          % available in Microchip dsPIC30F/33F, instruction FF1L 
    if p2<8
        pre_shift=0;
    else
        pre_shift=p2-8;
    end                                  % shr = (p2-8)>0?(p2-8):0;

    xn = shr(x, pre_shift);              % xn  = x>>pre_shift;
    y  = shr(lut(xn), pre_shift);        % y   = lut[xn]>pre_shift; 
    y  = shr(y * (2^32 - y*x), 30);      % basic iteration
                                         % step up from q31 to q32
    y  = shr(y * (2^33 - y*x), (64-q));  % step up from q32 to desired q
    if q>39
        y = shr(y * (2^(1+q) - y*x), (q));  % when q>40, additional 
                                            % iteration is required, 
    end                                     % no step up is performed
end
function y = shr(x, r)
    y=floor(x./2^r);             % simulate operator >>
end

test.m

test_number = (2^22-12345);
test_q      = 48;

lut_q31 = round(2^31 ./ [1,[1:1:255]]);
display(sprintf('tested 2^%d/%d, diff=%f\n',test_q, test_number,...
                 nrdiv( 39, (2^22-5), lut_q31) - 2^39/(2^22-5)));

sample output

tested 2^48/4181959, diff=-0.156250

reference:

Newton–Raphson division

user3528438
  • 2,737
  • 2
  • 23
  • 42
1

A little late but here is my solution.

First some assumptions:

Problem:

X=N/D where N is a constant ans a power of 2.

All 32 bit unsigned integers.

X is unknown but we have a good estimate (previous but no longer accurate solution).

An exact solution is not required.

Note: due to integer truncation this is not an accurate algorithm!

An iterative solution is okay (improves with each loop).

Division is much more expensive than multiplication:

For 32bit unsigned integer for Arduino UNO:

'+/-' ~0.75us

'*' ~3.5us

'/' ~36us 4 We seek to replace the Basically lets start with Newton's method:

Xnew=Xold-f(x)/(f`(x)

where f(x)=0 for the solution we seek.

Solving this I get:

Xnew=XNew*(C-X*D)/N

where C=2*N

First trick:

Now that the Numerator (constant) is now a Divisor (constant) then one solution here (which does not require the N to be a power of 2) is:

Xnew=XNew*(C-X*D)*A>>M

where C=2*N, A and M are constants (look for dividing by a constant tricks).

or (staying with Newtons method):

Xnew=XNew*(C-X*D)>>M

where C=2>>M where M is the power.

So I have 2 '*' (7.0us), a '-' (0.75us) and a '>>' (0.75us?) or 8.5us total (rather than 36us), excluding other overheads.

Limitations:

As the data type is 32 bit unsigned, 'M' should not exceed 15 else there will be problems with overflow (you can probably get around this using a 64bit intermediate data type).

N>D (else the algorithm blows up! at least with unsigned integer)

Obviously the algorithm will work with signed and float data types)

#include <stdio.h>
#include <stdlib.h>
#include <stdbool.h>
int main(void)
{
  unsigned long c,d,m,x;
  // x=n/d where n=1<<m
  m=15;
  c=2<<m;
  d=10;
  x=10;
  while (true)
  {
    x=x*(c-d*x)>>m;
    printf("%ld",x);
    getchar();
  }
  return(0);
}
Evil Dog Pie
  • 2,300
  • 2
  • 23
  • 46
AlanC
  • 11
  • 1
0

Having tried many alternatives, I ended up doing normal binary long division in assembly language. However, the routine does use a few optimisations that bring the execution time down to an acceptable level.

/*
 * Converts the reference frequency count for a specific signal frequency
 * to a ratio.
 *   Xs = Ns * 2^32 / Nr
 *   Where:
 *   2^32 is a constant scaling so that the maximum accuracy can be achieved.
 *   Ns is the number of signal counts (fixed at 0x4000 by hardware).
 *   Nr is the number of reference counts, passed in W1:W0.
 * @param  W1:W0    The number of reference frequency pulses.
 * @return W1:W0    The scaled ratio.
 */
    .align  2
    .global _signalToReferenceRatio
    .type   _signalToReferenceRatio, @function

    ; This is the position of the most significant bit of the fixed Ns (0x4000).
    .equ    LOG2_DIVIDEND,  14
    .equ    DIVISOR_LIMIT,  LOG2_DIVIDEND+1
    .equ    WORD_SIZE,      16

_signalToReferenceRatio:
    ; Create a dividend, MSB-aligned with the divisor, in W2:W3 and place the
    ; number of iterations required for the MSW in [W14] and the LSW in [W14+2].
    LNK     #4
    MUL.UU  W2, #0, W2
    FF1L    W1, W4
    ; If MSW is zero the argument is out of range.
    BRA     C, .returnZero
    SUBR    W4, #WORD_SIZE, W4
    ; Find the number of quotient MSW loops.
    ; This is effectively 1 + log2(dividend) - log2(divisor).
    SUBR    W4, #DIVISOR_LIMIT, [W14]
    BRA     NC, .returnZero
    ; Since the SUBR above is always non-negative and the C flag set, use this
    ; to set bit W3<W5> and the dividend in W2:W3 = 2^(16+W5) = 2^log2(divisor).
    BSW.C   W3, W4
    ; Use 16 quotient LSW loops.
    MOV     #WORD_SIZE, W4
    MOV     W4, [W14+2]

    ; Set up W4:W5 to hold the divisor and W0:W1 to hold the result.
    MOV.D   W0, W4
    MUL.UU  W0, #0, W0

.checkLoopCount:
    ; While the bit count is non-negative ...
    DEC     [W14], [W14]
    BRA     NC,  .nextWord

.alignQuotient:
    ; Shift the current quotient word up by one bit.
    SL      W0, W0
    ; Subtract divisor from the current dividend part.
    SUB     W2, W4, W6
    SUBB    W3, W5, W7
    ; Check if the dividend part was less than the divisor.
    BRA     NC, .didNotDivide
    ; It did divide, so set the LSB of the quotient.
    BSET    W0, #0
    ; Shift the remainder up by one bit, with the next zero in the LSB.
    SL      W7, W3
    BTSC    W6, #15
    BSET    W3, #0
    SL      W6, W2
    BRA     .checkLoopCount
.didNotDivide:
    ; Shift the next (zero) bit of the dividend into the LSB of the remainder.
    SL      W3, W3
    BTSC    W2, #15
    BSET    W3, #0
    SL      W2, W2
    BRA     .checkLoopCount

.nextWord:
    ; Test if there are any LSW bits left to calculate.
    MOV     [++W14], W6
    SUB     W6, #WORD_SIZE, [W14--]
    BRA     NC, .returnQ
    ; Decrement the remaining bit counter before writing it back.
    DEC     W6, [W14]
    ; Move the working part of the quotient up into the MSW of the result.
    MOV     W0, W1
    BRA     .alignQuotient

.returnQ:
    ; Return the quotient in W0:W1.
    ULNK
    RETURN

.returnZero:
    MUL.UU  W0, #0, W0
    ULNK
    RETURN
.size   _signalToReferenceRatio, .-_signalToReferenceRatio
Evil Dog Pie
  • 2,300
  • 2
  • 23
  • 46