9

Just found the following line in some old src code:

int e = (int)fmod(matrix[i], n);

where matrix is an array of int, and n is a size_t

I'm wondering why the use of fmod rather than % where we have integer arguments, i.e. why not:

int e = (matrix[i]) % n;

Could there possibly be a performance reason for choosing fmod over % or is it just a strange bit of code?

bph
  • 10,728
  • 15
  • 60
  • 135
  • 3
    `fmod` is using floating point values which will be converted to `double` and back. So: **no**. For integer arithmetic, please use the `%` operator. – Weather Vane Jan 16 '17 at 21:17
  • prob slower if anything? i'm not au-fait with producing the corresponding assembly from C statements, but I'm imagining there would be more of it if you were to use fmod over % – bph Jan 16 '17 at 21:18

3 Answers3

2

Could there possibly be a performance reason for choosing fmod over % or is it just a strange bit of code?

The fmod might be a bit faster on architectures with high-latency IDIV instruction, that takes (say) ~50 cycles or more, so fmod's function call and int <---> doubleconversions cost can be amortized.

According to Agner's Fog instruction tables, IDIV on AMD K10 architecture takes 24-55 cycles. Comparing with modern Intel Haswell, its latency range is listed as 22-29 cycles, however if there are no dependency chains, the reciprocal throughput is much better on Intel, 8-11 clock cycles.

Grzegorz Szpetkowski
  • 36,988
  • 6
  • 90
  • 137
2

fmod might be a tiny bit faster than the integer division on selected architectures.

Note however that if n has a known non zero value at compile time, matrix[i] % n would be compiled as a multiplication with a small adjustment, which should be much faster than both the integer modulus and the floating point modulus.

Another interesting difference is the behavior on n == 0 and INT_MIN % -1. The integer modulus operation invokes undefined behavior on overflow which results in abnormal program termination on many current architectures. Conversely, the floating point modulus does not have these corner cases, the result is +Infinity, -Infinity, Nan depending on the value of matrix[i] and -INT_MIN, all exceeding the range of int and the conversion back to int is implementation defined, but does not usually cause abnormal program termination. This might be the reason for the original programmer to have chosen this surprising solution.

chqrlie
  • 131,814
  • 10
  • 121
  • 189
  • For my particular scenario n > 0 and typically < ~1000, and its not a compile time constant. i've swapped out the fmod for %. On my intel i5 I saw a 40% speedup. thanks for additional insights – bph Jan 18 '17 at 10:49
1

Experimentally (and quite counter-intuitively), fmod is faster than % - at least on AMD Phenom(tm) II X4 955 with 6400 bogomips. Here are two programs that use either of the techniques, both compiled with the same compiler (GCC) and the same options (cc -O3 foo.c -lm), and ran on the same hardware:

#include <math.h>
#include <stdio.h>

int main()
{
    int volatile a=10,b=12;
    int i, sum = 0;
    for (i = 0; i < 1000000000; i++)
        sum += a % b;
    printf("%d\n", sum);
    return 0;
}

Running time: 9.07 sec.

#include <math.h>
#include <stdio.h>

int main()
{
    int volatile a=10,b=12;
    int i, sum = 0;
    for (i = 0; i < 1000000000; i++)
        sum += (int)fmod(a, b);
    printf("%d\n", sum);
    return 0;
}

Running time: 8.04 sec.

DYZ
  • 55,249
  • 10
  • 64
  • 93
  • 1
    now this is very interesting - nice one, I should have done that myself.. I'm glad I asked the quesion and got such a smart response.. – bph Jan 16 '17 at 21:54
  • 4
    On my system, version 1 runs in 3.07 seconds, version 2 runs in 8.97 seconds. So I get the opposite result, with a larger margin. It's going to depend a lot on the exact hardware you are using, and various other things as well. – Dietrich Epp Jan 16 '17 at 21:58
  • 1
    @DietrichEpp Out of curiosity, what CPU did you use? I guess my CPU is sluggish, but the FPU seems ok :) – DYZ Jan 16 '17 at 22:03
  • 4
    It's not *completely* unexpected, your loop code uses integer arithmetic, which means with `%` the floating-point execution units are sitting there doing nothing and the loop increment is waiting for the integer divide. With `fmod` the loop increment and modulo use different resources and take place concurrently. But this is all extremely specific to the code surrounding modulo and the CPU subarchitecture, and should not be used to make general statements such as the one the question is asking for. Another CPU has enough integer execution units to perform both integer division and loop update – Ben Voigt Jan 16 '17 at 22:08
  • 1
    It's an i5-4258U. Although `idiv` is notoriously slow, the relative amount of effort put in to optimizing integer and floating-point parts of a processor varies wildly between different processors. – Dietrich Epp Jan 16 '17 at 22:08
  • 1
    And another processor (Dietrich's case) has `fmod` so much slower than `%` that the mix of execution units makes no difference. – Ben Voigt Jan 16 '17 at 22:10
  • how are you timing it? – bph Jan 16 '17 at 22:13
  • Compare two thing with different behavior has no sense. – Stargateur Jan 16 '17 at 22:17
  • 1
    I get 2.949000 (`%`) and 29.470000 (`fmod`) seconds from MSVC 15.0 compilation. – Weather Vane Jan 16 '17 at 22:17
  • 1
    @bph `time foo` – DYZ Jan 16 '17 at 22:21
  • @Stargateur Both snippets are doing exactly the same thing. Where is the different behavior? – DYZ Jan 16 '17 at 22:23
  • Intel® Core™ i5-5200U CPU @ 2.20GHz × 4 - integer_modulus, 3.342s - floatingpoint_modulus, 5.727s – bph Jan 16 '17 at 22:32
  • @DYZ Well in your case it's produce the same result but use a function that use double can lead to different behavior if the integer can't be represent by the floating number. For example "9,007,199,254,740,993" is the first integer that double can't represent. It has no sense to use this function because its design for floating number not integer. – Stargateur Jan 16 '17 at 22:33
  • @Stargateur - theres sense in it if it produces the correct result over the no.s contained in the input array and its quicker? but alas for me its slower.. – bph Jan 16 '17 at 22:37
  • does the use of volatile prevent the compiler optimising away the calculation within the for loop? – bph Jan 16 '17 at 22:50
  • 1
    @bph Yes it does (at least my compiler). – DYZ Jan 16 '17 at 22:51
  • i'm using gcc (Ubuntu 5.4.0-6ubuntu1~16.04.4) 5.4.0 20160609 - must be respecting the volatile storage duration or I wouldn't have seen a difference in runtime? – bph Jan 16 '17 at 22:54