2

I am looking for a fast modulo 10 algorithm because I need to speed up my program which does many modulo operations in cycles.

I have checked out this page which compares some alternatives. As far as I understand it correctly, T3 was the fastest of all. My question is, how would x % y look like using T3 technique?

I copied T3 technique here for simplicity in case the link gets down.

for (int x = 0; x < max; x++)
{
        if (y > (threshold - 1))
        {
               y = 0; //reset
               total += x;
        }
        y += 1;
}

Regarding to comments, if this is not really faster then regular mod, I am looking for at least 2 times faster modulo than using %. I have seen many examples with use power of two, but since 10 is not, how can I get it to work?

Edit:

For my program, let's say I have 2 for cycles where n=1 000 000 and m=1000.

Looks like this:

for (i = 1; i <= n; i++) {
        D[(i%10)*m] = i;
        for (j = 1; j <= m; j++) {
           ...
        }
}
mrRobot
  • 301
  • 3
  • 12
  • 4
    Do you really believe it will be faster than `x % y` ? – Eugene Sh. Apr 27 '18 at 16:02
  • 4
    First check: Has your compiler writer perhaps also read this and already implemented an optimization for `x % 10`? – Bo Persson Apr 27 '18 at 16:03
  • 2
    Have you measured and benchmarked and profiled this is indeed a bottle-neck in your program? Have you checked the (optimized) generated code? Perhaps your problem is less of a modulo problem and more of a cache problem? – Some programmer dude Apr 27 '18 at 16:05
  • You may find this article on optimizing through avoiding use of modulus interesting: https://embeddedgurus.com/stack-overflow/2011/02/efficient-c-tip-13-use-the-modulus-operator-with-caution/ – Christian Gibbons Apr 27 '18 at 16:08
  • @BoPersson it is not. I have this confirmed. – mrRobot Apr 27 '18 at 16:09
  • 1
    @mrRobot Note a that compiler must assume `x,y` use the entire range of `int` (except `y==0`). To make something _faster_, knowing `y` is 10 **and** if `x` is using a subset of `int`, then faster code is often possible. If so, post the restricted values of `x,y` for your situation. – chux - Reinstate Monica Apr 27 '18 at 16:12
  • 3
    @mrRobot Consider optimizing the loop and not just the `%` calculation. – chux - Reinstate Monica Apr 27 '18 at 16:16
  • I have updated my question with particular code. – mrRobot Apr 27 '18 at 16:20
  • 3
    How about breaking up the outer loop? `for (ii = 1; ii <= n; ii += 10) { for (i=0; i< 10; i++) { D[i*m] = ii + i;...` or some variation and skip the use of `%`? It that allowable? – chux - Reinstate Monica Apr 27 '18 at 16:23
  • 1
    For your example values of n=1000000 and m=1000 the inner loop (j) is being performed 1 billion times. Maybe something in that loop could be sped up. – Bob Jarvis - Слава Україні Apr 27 '18 at 16:28
  • @BobJarvis only `%` operations in there which could be faster with what I am asking for. Nothing else could be. Maybe the use of the `for loop` which @chux mentioned. – mrRobot Apr 27 '18 at 16:31
  • I wouldn't call that "T3" a technique, when one can evaluate the sum of all x%n == 0 without a loop. – Aki Suihkonen Apr 27 '18 at 16:42
  • @BobJarvis To be fair, the rest of the loop body consists of integer comparisons and additions. It’s likely not the bottleneck (although it does perform branching). – Konrad Rudolph Apr 27 '18 at 16:44
  • 1
    @KonradRudolph - last year one of my coworkers had a performance problem. An important program was running for hours, but (per my coworker's explanation) it was just a simple query. I was handed the 10 line query and told "fix it". I looked at the plan and tried alternatives but I just *couldn't* get the query optimizer to do any better. Finally, in frustration, I ran it - and found it returned results in < 10 seconds. Hmmm.... I searched the code base and found...yes, the *basic* query was 10 lines - but the *complete* query was 400+ lines of piled-on SQL. Moral: SHOW ME THE *COMPLETE* CODE! – Bob Jarvis - Слава Україні Apr 28 '18 at 18:16

5 Answers5

13

Here's the fastest modulo-10 function you can write:

unsigned mod10(unsigned x)
{
    return x % 10;
}

And here's what it looks like once compiled:

movsxd rax, edi
imul rcx, rax, 1717986919
mov rdx, rcx
shr rdx, 63
sar rcx, 34
add ecx, edx
add ecx, ecx
lea ecx, [rcx + 4*rcx]
sub eax, ecx
ret

Note the lack of division/modulus instructions, the mysterious constants, the use of an instruction which was originally intended for complex array indexing, etc. Needless to say, the compiler knows a lot of tricks to make your program as fast as possible. You'll rarely beat it on tasks like this.

Sneftel
  • 40,271
  • 12
  • 71
  • 104
  • 1
    I was about to write a comment saying "...I'd be surprised if a DIV instruction doesn't figure prominently in the computation of remainder". Well - I'm surprised that a DIV instruction doesn't figure prominently in the computation of the remainder. :-) – Bob Jarvis - Слава Україні Apr 27 '18 at 16:48
  • @BobJarvis Not the DIV instruction itself, but looking closely at that assembly, it's built from a div-by-constant, a mul-by-constant, and a subtraction. – Sneftel Apr 27 '18 at 16:50
  • That might (if the compiler is optimal) the fastest code one could write for `x % 10` in general for `unsigned x`. However, given constraints for a specific program, optimizations may be possible. We should not assert this is the fastest possible solution without qualification. – Eric Postpischil Apr 27 '18 at 17:06
  • The constant 1717986919 is 0x66666667. So, the number of the beast, expressed in base 16, extended to 32 bits by duplicating the top-most hex digit, plus one. Clearly a sign of the upcoming [End Of The World](https://www.youtube.com/watch?v=-hJQ18S6aag)! – Bob Jarvis - Слава Україні Apr 27 '18 at 17:10
  • @EricPostpischil Fair enough. Particularly in the case of wrapping in a loop, I would expect `if(i >= 10) i = 0;` to be superior unless the optimizer was *exceptionally* clever. – Sneftel Apr 27 '18 at 19:05
1

You likely can't beat the compiler.

Debug build

//     int foo = x % 10;
010341C5  mov         eax,dword ptr [x]  
010341C8  cdq  
010341C9  mov         ecx,0Ah  
010341CE  idiv        eax,ecx  
010341D0  mov         dword ptr [foo],edx  

Retail build (doing some ninja math there...)

//    int foo = x % 10;
00BD100E  mov         eax,66666667h  
00BD1013  imul        esi  
00BD1015  sar         edx,2  
00BD1018  mov         ecx,edx  
00BD101A  shr         ecx,1Fh  
00BD101D  add         ecx,edx  
00BD101F  lea         eax,[ecx+ecx*4]  
00BD1022  add         eax,eax  
00BD1024  sub         esi,eax
selbie
  • 100,020
  • 15
  • 103
  • 173
  • 2
    Well... theoretically, time-wise `idiv` might take longer than a bunch of other operations. – Eugene Sh. Apr 27 '18 at 16:17
  • 1
    This answer is incorrect in its details: division and modulus by a constant are NOT implemented with `div`/`idiv`. (But it is still best to leave the optimization to the compiler.) – Sneftel Apr 27 '18 at 16:18
  • Updated to show the optimizations for a retail build. – selbie Apr 27 '18 at 16:25
0

The code isn’t a direct substitute for modulo, it substitutes modulo in that situation. You can write your own mod by analogy (for a, b > 0):

int mod(int a, int b) {
    while (a >= b) a -= b;
    return a;
}

… but whether that’s faster than % is highly questionable.

Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214
  • Well, to ask more specific question, I am looking for at least 2x faster mod algorithm. – mrRobot Apr 27 '18 at 16:05
  • 2
    @mrRobot Before trying to optimise builtin arithmetic operations, you should try to optimise pretty much everything else. It’s not impossible that `%` is indeed the bottleneck in your code but it’s more likely that other optimisations lead to more substantial improvements (and are easier to perform). – Konrad Rudolph Apr 27 '18 at 16:06
  • 3
    I will bet money on that you can't beat the modern optimizer. Not in this. – Eugene Sh. Apr 27 '18 at 16:06
  • Glad you pointed that out, but I can confirm, using for example power of two is at least 2 times faster than `%` method. But I don't know how to get `%10` using power of two algo. – mrRobot Apr 27 '18 at 16:08
  • @mrRobot 2x faster for the mod operation, sure. But does that translate into (anything near) 2x for the overall algorithm? *This* is the part we’re doubtful about. – Konrad Rudolph Apr 27 '18 at 16:10
  • @mrRobot In addition, I doubt your claim that you can optimise the operation `% 2` on a modern C compiler. Because the compiler obviously knows that trick too, and it’s a trivial optimisation that any modern compiler will perform. – Konrad Rudolph Apr 27 '18 at 16:12
  • @KonradRudolph yes of course. I am calculating edit distance algo and all of the operations run in 2 for cycles, where in total, `%` calculations run about 1 mil times. Basically `%` takes about 80% of the entire runtime of my program. – mrRobot Apr 27 '18 at 16:13
  • @KonradRudolph Looks like comparing apples to oranges here. Obviously `x & 1` will be faster than `x % y` when `y` is decided to be `2` in the runtime. – Eugene Sh. Apr 27 '18 at 16:14
  • @mrRobot I’m intimately familiar with the [edit distance algorithm](https://en.wikipedia.org/wiki/Edit_distance), but it contains no mod operation. – Konrad Rudolph Apr 27 '18 at 16:15
  • @EugeneSh. Note: with `int x`, `x&1` is functional different than `x%2` should negative `x` occur. – chux - Reinstate Monica Apr 27 '18 at 16:15
  • @KonradRudolph yes it does not, unless you try to optimize it by saving memory. Then you can use `%` to use only `n` rows of the table. That's what I do. – mrRobot Apr 27 '18 at 16:21
  • @mrRobot Odd. I do know several linear space algorithms to compute the edit distance, but they don’t use mod either. – Konrad Rudolph Apr 27 '18 at 16:42
  • @KonradRudolph could you provide me some source how to do it? – mrRobot Apr 27 '18 at 16:44
  • @mrRobot Here: https://en.wikipedia.org/wiki/Levenshtein_distance#Iterative_with_two_matrix_rows — but you don’t even need two matrix rows, a 1D vector is enough; simply treat all values after the current index as coming from the row to the left, and everything before the current index as the current row. You need to store one additional element in the vector to allow for the diagonally top-left ancestor cell. – Konrad Rudolph Apr 27 '18 at 16:48
  • If `f(unsigned x)` and `g(unsigned x)` are both supposed to compute `x%10`, but `f(x)` will usually be invoked when `x` is 9 or less, sometimes invoked when `x` is 10 to 19, rarely when `x` is 20 to 29, and never when `x` is greater, while `g(x)` will usually be invoked when `x` is at least 1,000,000, there's no way a compiler that doesn't know the expected distributions of `x` could generate optimal code for both. – supercat Apr 10 '23 at 16:49
  • @supercat I don’t know enough about how the corresponding CPU intrinsics work on modern CPUs to comment on that statement so I defer to your knowledge (I’ll note that my answer didn’t mean to imply otherwise: I was talking about the general case). But I’m curious: are you saying that a branch + subtractions would beat the intrinsic mod operation? For what it’s worth I wrote an extremely simple (and probably wrong) benchmark [which seems to show the opposite](https://quick-bench.com/q/PmXOP9xG2YieJOP_lvStmifvVGk). – Konrad Rudolph Apr 10 '23 at 19:17
  • @KonradRudolph: I would expect that on most CPUs a compare and branch would be faster than the intrinsic mod operation in cases where the value was already in the 0..9 range. The compare-and-branch approach might be slightly slower than a shift-based mod operation for values in the range 20..29, but if those values are much less common than values 0..9, the performance benefit from handling the smaller values faster could outweigh the occasional penalty for the rare values. If values are all going to be much larger than 10, however, any time spent on comparisons with 10... – supercat Apr 10 '23 at 19:38
  • ...would be wasted. Even though something like `if (x < 10) return x; else return x % 10;` would probably not be much worse than `return x % 10;` even if `x` was never less than 10, the extra comparison would be purely wasteful if `x` was always greater than 10. – supercat Apr 10 '23 at 19:39
0

I come across this discussion and while for uint64_t the best way to perform a mod 10 operation is indeed through the usage of the compiler on my standard laptop. However for unt128_t on my recent ubuntu linux i get, for the routine:

for (int i = 0; i < 1000000000; i++)
{
  uint128_t x = n + i;
  s += x % 10;
}

The timing:

   Executed in   21,74 secs   fish           external 
   usr time      21,73 secs  420,00 micros   21,73 secs 
   sys time       0,00 secs  237,00 micros    0,00 secs 

This is very different from the result I got from using uint64_t instead. So one could expect to do something clever here (and I bet that in future versions of gcc they would implement some form of the following trick). We can take advantage of the rules,

(a+b) mod 10   = (a mod 10 + b mod 10) mod 10

And

(ab) mode 10 = ((a mode 10)*(b mod 10)) mod 10

To produce the code,

for (int i = 0; i < 1000000000; i++)
{
  uint128_t x = n + i;
  uint64_t  a = (uint64_t)(x >> 64);
  uint64_t  b = (uint64_t)(x & (~0UL));
  
  s += ((a%10)*2*((1UL<<63)%10) + (b%10))%10;
}

This benchmarks at speed,

Executed in    3,55 secs   fish           external 
usr time       3,55 secs  409,00 micros    3,55 secs 
sys time       0,00 secs  233,00 micros    0,00 secse here

A nice 5x speedup for the modulo 10 operation. Note that 10 is not magic here apart from the compiler possible being extra smart about 10 for 64 bit unsigned integers. Similar trick can be done for integer division by 10 where we note that we always can write a number x as x = a 10 + b, where a = x/10 and b = x%10, then again we can study x1*x2 and x1+x2 to deduce similar rules for the integer division of 128 bit integers taking advantage of the fast versions of the 64 bit ones. If one do the work we can produce the following code,

inline uint128_t div10q(uint128_t x)
{
  uint64_t   x1  = (uint64_t)(x >> 64);
  uint128_t  x2  = ((unt128_t)1) << 64;
  uint64_t   x3  = (uint64_t)(x & (~0UL));

  uint64_t   b1   = x1%10;
  uint128_t  y1  = x1/10;

  uint64_t   b2  = x2%10;
  uint128_t  y2  = x2/10;

  uint128_t yy1 = y1*y2*10+b1*y2+b2*y1 + (b1*b2)/10;
  uint64_t  bb1 = (b1*b2)%10;

  uint64_t  bb2 = x3 % 10;
  uint128_t yy2 = x3 / 10;

  return yy1+yy2+(bb1+bb2)/10;
}

That compiles to a similar speedup of 5X using the optimization -O3 in gcc.

Stefan
  • 271
  • 1
  • 4
-3

This will work for (multiword) values larger than the machineword (but assuming a binary computer ...):


#include <stdio.h>

unsigned long mod10(unsigned long val)
{
unsigned res=0;

res =val &0xf;
while (res>=10) { res -= 10; }

for(val >>= 4; val; val >>= 4){
        res += 6 * (val&0xf);
        while (res >= 10) { res -= 10; }
        }

return res;
}

int main (int argc, char **argv)
{
unsigned long val;
unsigned res;

sscanf(argv[1], "%lu", &val);

res = mod10(val);
printf("%lu -->%u\n", val,res);

return 0;
}

UPDATE: With some extra effort, you could get the algoritm free of multiplications, and with the proper amount of optimisation we can even get the recursive call inlined:


static unsigned long mod10_1(unsigned long val)
{
unsigned char res=0; //just to show that we don't need a big accumulator

res =val &0xf; // res can never be > 15
if (res>=10) { res -= 10; }

for(val >>= 4; val; val >>= 4){
        res += (val&0xf)<<2 | (val&0xf) <<1;
        res= mod10_1(res); // the recursive call
        }

return res;
}

And the result for mod10_1 appears to be mul/div free and almost without branches:


mod10_1:
.LFB25:
    .cfi_startproc
    movl    %edi, %eax
    andl    $15, %eax
    leal    -10(%rax), %edx
    cmpb    $10, %al
    cmovnb  %edx, %eax
    movq    %rdi, %rdx
    shrq    $4, %rdx
    testq   %rdx, %rdx
    je      .L12
    pushq   %r12
    .cfi_def_cfa_offset 16
    .cfi_offset 12, -16
    pushq   %rbp
    .cfi_def_cfa_offset 24
    .cfi_offset 6, -24
    pushq   %rbx
    .cfi_def_cfa_offset 32
    .cfi_offset 3, -32
.L4:
    movl    %edx, %ecx
    andl    $15, %ecx
    leal    (%rcx,%rcx,2), %ecx
    leal    (%rax,%rcx,2), %eax
    movl    %eax, %ecx
    movzbl  %al, %esi
    andl    $15, %ecx
    leal    -10(%rcx), %r9d
    cmpb    $9, %cl
    cmovbe  %ecx, %r9d
    shrq    $4, %rsi
    leal    (%rsi,%rsi,2), %ecx
    leal    (%r9,%rcx,2), %ecx
    movl    %ecx, %edi
    movzbl  %cl, %ecx
    andl    $15, %edi
    testq   %rsi, %rsi
    setne   %r10b
    cmpb    $9, %dil
    leal    -10(%rdi), %eax
    seta    %sil
    testb   %r10b, %sil
    cmove   %edi, %eax
    shrq    $4, %rcx
    andl    $1, %r10d
    leal    (%rcx,%rcx,2), %r8d
    movl    %r10d, %r11d
    leal    (%rax,%r8,2), %r8d
    movl    %r8d, %edi
    andl    $15, %edi
    testq   %rcx, %rcx
    setne   %sil
    leal    -10(%rdi), %ecx
    andl    %esi, %r11d
    cmpb    $9, %dil
    seta    %bl
    testb   %r11b, %bl
    cmovne  %ecx, %edi
    andl    $1, %r11d
    andl    $240, %r8d
    leal    6(%rdi), %ebx
    setne   %cl
    movl    %r11d, %r8d
    andl    %ecx, %r8d
    leal    -4(%rdi), %ebp
    cmpb    $9, %bl
    seta    %r12b
    testb   %r8b, %r12b
    cmovne  %ebp, %ebx
    andl    $1, %r8d
    cmovne  %ebx, %edi
    xorl    $1, %ecx
    andl    %r11d, %ecx
    orb     %r8b, %cl
    cmovne  %edi, %eax
    xorl    $1, %esi
    andl    %r10d, %esi
    orb     %sil, %cl
    cmove   %r9d, %eax
    shrq    $4, %rdx
    testq   %rdx, %rdx
    jne     .L4
    popq    %rbx
    .cfi_restore 3
    .cfi_def_cfa_offset 24
    popq    %rbp
    .cfi_restore 6
    .cfi_def_cfa_offset 16
    movzbl  %al, %eax
    popq    %r12
    .cfi_restore 12
    .cfi_def_cfa_offset 8
    ret
.L12:
    movzbl  %al, %eax
    ret
    .cfi_endproc
.LFE25:
    .size   mod10_1, .-mod10_1
    .p2align 4,,15
    .globl  mod10
    .type   mod10, @function
wildplasser
  • 43,142
  • 8
  • 66
  • 109
  • Could you format the algorithm proper and add some explanation? For example, why the initial `while` loop? It will run at most once. – Konrad Rudolph Apr 27 '18 at 17:04
  • Further optimisation is left as an exercise for the reader. (assuming homework,here) – wildplasser Apr 27 '18 at 17:06
  • 1
    … did you just excuse a logic mistake and lack of explanation as “not doing homework”? – Konrad Rudolph Apr 27 '18 at 17:08
  • @KonradRudolph Yes. And I **know** that 16 is not above 19. And that 99 is not above 99,.. – wildplasser Apr 27 '18 at 17:12
  • I would not have expected to see someone *approvingly* describe a modulo implementation as "almost without branches". Generally the expected number would be zero. – Sneftel Apr 28 '18 at 11:42
  • @Sneftel :could you demonstrate a method that does not need multiplication or division (and can be extended to *bignums*, and to 8-bit machines) and uses fewer branches? – wildplasser Apr 28 '18 at 11:54
  • 1
    The OP is clearly not asking about multiword arithmetic, but any of the methods in the current answers are trivial to extend in that way. And why are you so keen to avoid multiplication? I'd take ten multiplications over one mispredicted branch any day. – Sneftel Apr 28 '18 at 12:05