14

I'm writing some code for a very limited system where the mod operator is very slow. In my code a modulo needs to be used about 180 times per second and I figured that removing it as much as possible would significantly increase the speed of my code, as of now one cycle of my mainloop does not run in 1/60 of a second as it should. I was wondering if it was possible to re-implement the modulo using only bit shifts like is possible with multiplication and division. So here is my code so far in c++ (if i can perform a modulo using assembly it would be even better). How can I remove the modulo without using division or multiplication?

    while(input > 0)
{
    out = (out << 3) + (out << 1);
    out += input % 10;

    input = (input >> 8) + (input >> 1);
}

EDIT: Actually I realized that I need to do it way more than 180 times per second. Seeing as the value of input can be a very large number up to 40 digits.

PgrAm
  • 671
  • 2
  • 7
  • 19
  • 2
    180 times/second... on what hardware? That's nothing on a modern non-embedded processor. – Mysticial Jun 18 '12 at 02:02
  • 1
    On a 16 bit processor. I know it's nothing but there's alot of other code the need to finish in 1/60 of a second and the modulo needs to happen three times for every cycle of the mainloop. I want to squeeze out as much speed as I can. – PgrAm Jun 18 '12 at 02:05
  • Does the modulus satisfy any sort of property? Are you using the same modulus many times. If neither is the case, I doubt you can do any better than the hardware division instruction. – Mysticial Jun 18 '12 at 02:08
  • @Shawn B turbo c++ 3.0, I know its old but I need 286 support. – PgrAm Jun 18 '12 at 02:19
  • No need for the but, its what you need for the project. It might be worth checking out the assembly output of the code if you can, it would have more definitive information about what is happening. – Shawn Buckley Jun 18 '12 at 02:23
  • @PgrAm Does your chip have native divide instruction and is it 32bit? Then the instruction returns a remainder even if C doesn't compile for it. Otherwise there is no universally fast modulo algorithm. Best to code around it to remove modulo altogether. Else restrict the range and use a combination of shifts and table lookups. – starbolin Jun 18 '12 at 02:28
  • @PgrAm The div() function in your math library may have a better code than your compiler produces. – starbolin Jun 18 '12 at 02:32
  • 1
    @PgrAm : "*I need 286 support*" What? Why? What planet do you live on? – ildjarn Jun 18 '12 at 03:23
  • 2
    40 digits? a 64-bit number is only 19.1 digits. how can your number be 40 digits? – std''OrgnlDave Jun 18 '12 at 03:29
  • Is this for a demoscene product? ;) (Those who don't know, computer subculture focused around audiovisual presentations called demos and intros, so yes, many people program for ancient 80's machines with no optimizing compilers whatsoever. Every cycle and byte counts.) – zxcdw Jun 18 '12 at 05:14
  • @zxcdw you have the general idea – PgrAm Jun 18 '12 at 15:33
  • What compiler are you using? Division by constants is a well known optimization that gcc is doing and since the correlation between the remainder and quotient is obvious you should actually get by with 2 muls and some adds/shifts. – Voo Jun 18 '12 at 20:43

5 Answers5

24

What you can do with simple bitwise operations is taking a power-of-two modulo(divisor) of the value(dividend) by AND'ing it with divisor-1. A few examples:

unsigned int val = 123; // initial value
unsigned int rem;

rem = val & 0x3; // remainder after value is divided by 4. 
                 // Equivalent to 'val % 4'
rem = val % 5;   // remainder after value is divided by 5.
                 // Because 5 isn't power of two, we can't simply AND it with 5-1(=4). 

Why it works? Let's consider a bit pattern for the value 123 which is 1111011 and then the divisor 4, which has the bit pattern of 00000100. As we know by now, the divisor has to be power-of-two(as 4 is) and we need to decrement it by one(from 4 to 3 in decimal) which yields us the bit pattern 00000011. After we bitwise-AND both the original 123 and 3, the resulting bit pattern will be 00000011. That turns out to be 3 in decimal. The reason why we need a power-of-two divisor is that once we decrement them by one, we get all the less significant bits set to 1 and the rest are 0. Once we do the bitwise-AND, it 'cancels out' the more significant bits from the original value, and leaves us with simply the remainder of the original value divided by the divisor.

However, applying something specific like this for arbitrary divisors is not going to work unless you know your divisors beforehand(at compile time, and even then requires divisor-specific codepaths) - resolving it run-time is not feasible, especially not in your case where performance matters.

Also there's a previous question related to the subject which probably has interesting information on the matter from different points of view.

Community
  • 1
  • 1
zxcdw
  • 1,629
  • 1
  • 10
  • 18
  • 1
    I had a similar question as to why only "(Power of 2) - 1" works with modulo. Thank you for the explanation! – whitehat Oct 17 '15 at 18:37
4

Actually division by constants is a well known optimization for compilers and in fact, gcc is already doing it.

This simple code snippet:

int mod(int val) {
   return val % 10;
}

Generates the following code on my rather old gcc with -O3:

_mod:
        push    ebp
        mov     edx, 1717986919
        mov     ebp, esp
        mov     ecx, DWORD PTR [ebp+8]
        pop     ebp
        mov     eax, ecx
        imul    edx
        mov     eax, ecx
        sar     eax, 31
        sar     edx, 2
        sub     edx, eax
        lea     eax, [edx+edx*4]
        mov     edx, ecx
        add     eax, eax
        sub     edx, eax
        mov     eax, edx
        ret

If you disregard the function epilogue/prologue, basically two muls (indeed on x86 we're lucky and can use lea for one) and some shifts and adds/subs. I know that I already explained the theory behind this optimization somewhere, so I'll see if I can find that post before explaining it yet again.

Now on modern CPUs that's certainly faster than accessing memory (even if you hit the cache), but whether it's faster for your obviously a bit more ancient CPU is a question that can only be answered with benchmarking (and also make sure your compiler is doing that optimization, otherwise you can always just "steal" the gcc version here ;) ). Especially considering that it depends on an efficient mulhs (ie higher bits of a multiply instruction) to be efficient. Note that this code is not size independent - to be exact the magic number changes (and maybe also parts of the add/shifts), but that can be adapted.

Voo
  • 29,040
  • 11
  • 82
  • 156
2

Doing modulo 10 with bit shifts is going to be hard and ugly, since bit shifts are inherently binary (on any machine you're going to be running on today). If you think about it, bit shifts are simply multiply or divide by 2.

But there's an obvious space-time trade you could make here: set up a table of values for out and out % 10 and look it up. Then the line becomes

  out += tab[out]

and with any luck at all, that will turn out to be one 16-bit add and a store operation.

Charlie Martin
  • 110,348
  • 25
  • 193
  • 263
  • 1
    I don't care about difficulty or ugliness only speed. However a table would waste too much of my memory seeing as the table would have to be 40^10 elements in size. – PgrAm Jun 18 '12 at 02:18
  • You want to think that one out again. – Charlie Martin Jun 18 '12 at 03:04
  • 2
    You can break it into two bytes since modulus is distributive over addition. Them you need a table of only 512 entries for a 16-bit integer. – Raymond Chen Jun 18 '12 at 03:18
  • Since 10 is divisible by 2, you only need 128 entries for the LSB. After that, it's still efficient to break it into any number of smaller pieces, but at some point the computation will be more than the division-multiplication-subtraction algorithm. Note that it's distributive, but converting the sum back to a modulus requires a second modulus operation, so the algo becomes recursive. – Potatoswatter Jun 18 '12 at 06:00
1

If you want to do modulo 10 and shifts, maybe you can adapt double dabble algorithm to your needs?

This algorithm is used to convert binary numbers to decimal without using modulo or division.

Rafał Rawicki
  • 22,324
  • 5
  • 59
  • 79
1

Every power of 16 ends in 6. If you represent the number as a sum of powers of 16 (i.e. break it into nybbles), then each term contributes to the last digit in the same way, except the one's place.

0x481A % 10 = ( 0x4 * 6 + 0x8 * 6 + 0x1 * 6 + 0xA ) % 10

Note that 6 = 5 + 1, and the 5's will cancel out if there are an even number of them. So just sum the nybbles (except the last one) and add 5 if the result is odd.

0x481A % 10 = ( 0x4 + 0x8 + 0x1 /* sum = 13 */
                + 5 /* so add 5 */ + 0xA /* and the one's place */ ) % 10
            = 28 % 10

This reduces the 16-bit, 4-nybble modulo to a number at most 0xF * 4 + 5 = 65. In binary, that is annoyingly still 3 nybbles so you would need to repeat the algorithm (although one of them doesn't really count).

But the 286 should have reasonably efficient BCD addition that you can use to perform the sum and obtain the result in one pass. (That requires converting each nybble to BCD manually; I don't know enough about the platform to say how to optimize that or whether it's problematic.)

Potatoswatter
  • 134,909
  • 25
  • 265
  • 421
  • 1
    [DAA - Decimal Adjust for Addition](http://www.penguin.cz/~literakl/intel/d.html) et al. should come in handy – sehe Jun 18 '12 at 10:51
  • Hmm, the 286 has [22 cycle](http://umcs.maine.edu/~cmeadow/courses/cos335/80x86-Integer-Instruction-Set-Clocks.pdf) 16-bit division. That's gonna be hard to beat this way, especially with no barrel shifter (!). Maybe this is still helpful, depending what OP means by "40 digits." Likewise, not clear how 180 times per second would be a problem in the first place. – Potatoswatter Jun 18 '12 at 18:39