-4

Why to test speed of modulus?


I have an app where modulus operation is performed millions times a second. I have to work with very big numbers, so I chose unsigned long long as a data type. About a week ago I've written a new algorithm for my app that required performing modulus operation on numbers which are much less than the numbers I used to work with (e.g. 26 instead of 10000000). I chose to use unsigned int as a data type. The speed increased dramatically while the algorithm is almost the same.

Testing...


I've written two simple programs in C to test the speed of modulus calculation.

#include <stdio.h>

typedef unsigned long long ull;

int main(){
   puts("Testing modulus with ull...");
   ull cnt;
   ull k, accum=0;
   for(k=1, cnt=98765432;k<=10000000;++k,--cnt) 
      accum+=cnt%80;
   printf("%llu\n",accum);
   return 0;
}

The only thing I was changing was the type of the variable called cnt.

I tested these programs with time ./progname and the results were as follows.

  • With unsigned long long: 3.28 sec
  • With unsigned int: 0.33 sec

Note: I'm testing it on a jailbroken iPad, that's why it takes so much time.

Why?


Why does the version with unsigned long long take so much time to run?

Update1: added --cnt to the loop so cnt%80 won't be constant; still the same results.

Update2: removed printf and added accum to get rid of time taken by printf; results are much less now but still pretty different.

ForceBru
  • 43,482
  • 10
  • 63
  • 98
  • 1
    Hint: `printf("%d vs %d", sizeof(unsigned int), sizeof(unsigned long long));` – Morten Jensen Jun 28 '15 at 15:31
  • 1
    The problem with printing in timing tests, is that you also test the timing of the printing. And console output is considerable slower than plain modulus. – Some programmer dude Jun 28 '15 at 15:34
  • @YuHao, to tell the truth, I expected the speed to be equal, but it turned out that it isn't... – ForceBru Jun 28 '15 at 15:34
  • 2
    This test program is invalid. The majority of its time is going to be spent in `printf`. And `cnt%80` has type `unsigned long long`, not `unsigned int`, so you can't print it with the `%u` specifier. – R.. GitHub STOP HELPING ICE Jun 28 '15 at 15:35
  • @JoachimPileborg, I've triple checked the results, they're always almost the same. The `printf` call is in both programs, so it shouldn't affect the results much. – ForceBru Jun 28 '15 at 15:36
  • @R.., I agree about the type of the result. At first I've been testing it with another variable of type `unsigned int`. I forgot to change the format specifier here. – ForceBru Jun 28 '15 at 15:38
  • How do you compile your program? Do you enable optimizations? – dlask Jun 28 '15 at 15:44
  • @dlask, I'm doing my best to force the compiler to remove any optimizations (with `gcc -O0 prog prog.c`) – ForceBru Jun 28 '15 at 15:46
  • without optimization result is meaningless. For the result, can you do a 4-digit division slower than a 8-bit one? – phuclv Jun 28 '15 at 16:08
  • @ForceBru: That's your problem. If you're using `-O0` your results are nonsense. Switch to `-O2`, `-O3`, or `-Os`. – R.. GitHub STOP HELPING ICE Jun 28 '15 at 16:35
  • @R.., just did it. The timings became even worse: 16.7 with `ull` and 12.7 with `unsigned int`. – ForceBru Jun 28 '15 at 16:39

2 Answers2

2

Fundamentally, the amount of time it takes to perform an arithmetic operation scales at least linearly with the number of bits in the operands. For modern cpus, the time is constant (usually one cycle) for addition, subtraction, logical operations, and maybe multiplication when the operands fit in registers, but scale up to RSA order of magnitude or other "bignum" usage and you will clearly see how the time to perform arithmetic scales.

In the case of division and remainder operations, these are inherently more costly, and often you will notice a significant difference with different operand sizes. And of course if your cpu is 32-bit, doing a 64-bit division/remainder operation is going to involve constructing it out of multiple smaller operations (much like a small special-case of "bignum" arithmetic) so it's going to be considerably slower.

Your test, however, is just completely invalid. The division is constant so it should not even be recomputed on each loop iteration, the time spent in the loop should be dominated by printf, and the format specifier you're using with printf is not valid for printing arguments of type unsigned long long, so your program has undefined behavior.

R.. GitHub STOP HELPING ICE
  • 208,859
  • 35
  • 376
  • 711
  • I've updated the code so `cnt%80` won't be constant. I've also fixed the case with wrong format specifier in `printf`. – ForceBru Jun 28 '15 at 15:47
  • Having the `printf` in the loop is still dominating the time. Instead do something like `accum += cnt%80` on each iteration, then print `accum` after the loop. Also keep in mind that division/remainder with a constant divisor is going to have significantly different performance properties from division/remainder with a divisor that can vary at runtime. – R.. GitHub STOP HELPING ICE Jun 28 '15 at 16:37
  • just removed `printf` and added `accum`. Well, I didn't think `printf` can take _that_ much time... – ForceBru Jun 28 '15 at 16:48
  • You still have undefined behavior if you print `unsigned int` with `%llu` and only change the typedef. Anyway, the reason for slowness in long long was already answered. If you do it with bigint, the difference is even much bigger – phuclv Jun 29 '15 at 04:56
1

Assuming a 32bit system, the difference is in 64bit versus 32bit modulo operation.

ull cnt;

results in (using -O2 optimization):

.L2:
    pushl   $0
    pushl   $80
    pushl   %edi
    pushl   %esi
    call    __umoddi3           ; note the function call here
    addl    $16, %esp
    addl    %eax, -32(%ebp)
    adcl    %edx, -28(%ebp)
    addl    $-1, %esi
    movl    %esi, %eax
    adcl    $-1, %edi
    xorl    $88765432, %eax
    orl     %edi, %eax
    jne     .L2
    pushl   -28(%ebp)
    pushl   -32(%ebp)
    pushl   $.LC1
    pushl   $1
    call    __printf_chk

while

unsigned int cnt;

results in (using -O2 optimization too):

.L2:
    movl    %ecx, %eax
    mull    %ebx
    shrl    $6, %edx
    leal    (%edx,%edx,4), %eax
    movl    %ecx, %edx
    sall    $4, %eax
    subl    %eax, %edx
    movl    %edx, %eax
    xorl    %edx, %edx
    addl    %eax, %esi
    adcl    %edx, %edi
    subl    $1, %ecx
    cmpl    $88765432, %ecx
    jne     .L2
    pushl   %edi
    pushl   %esi
    pushl   $.LC1
    pushl   $1
    call    __printf_chk

Considering also the amount of code in the __umoddi3 function we have the question answered.

dlask
  • 8,776
  • 1
  • 26
  • 30
  • the OP is using iPad, so it's ARM, not x86. But the point applies, dividing a 64-bit value is almost always slower than dividing a 32-bit one except on very strange architectures. The result assembly for ARM/ARM64 & x86/x86_64 can be seen [here](https://goo.gl/ZhaBBw) – phuclv Jun 29 '15 at 04:51
  • In the case of ARM, there's generally no hardware divide instruction and division is performed by one long function that does it a bit at a time. Remainder is performed by dividing (with that function) then multiplying and subtracting. – R.. GitHub STOP HELPING ICE Jun 29 '15 at 05:19
  • And obviously a soft-div done one bit at a time is going to take considerably longer for 64 bits than for 32... – R.. GitHub STOP HELPING ICE Jun 29 '15 at 05:19
  • Thank you all for your clarification. My main intention was to demonstrate that arithmetic operations can take a lot of time if they don't have the appropriate hardware support. Anyway, I should have shown the ARM assembly code instead of the Intel one. Shall I change my answer? – dlask Jun 29 '15 at 05:39