Rounding integers routine

Question

There is something that baffles me with integer arithmetic in tutorials. To be precise, integer division.

The seemingly preferred method is by casting the divisor into a float, then rounding the float to the nearest whole number, then cast that back into integer:

#include <cmath>

int round_divide_by_float_casting(int a, int b){
    return  (int) std::roundf( a / (float) b);
}

Yet this seems like scratching your left ear with your right hand. I use:

int round_divide (int a, int b){
    return a / b + a % b * 2 / b;
}

It's no breakthrough, but the fact that it is not standard makes me wonder if I am missing anything?

Despite my (albeit limited) testing, I couldn't find any scenario where the two methods give me different results. Did someone run into some sort of scenario where the int → float → int casting produced more accurate results?

For me the first one is much clearer. I know what it is trying to do. I know how it is trying to do it. Without running though it on paper I have no idea what the second one is doing. I'm also tipping the first is faster because the second does several arithmetic operations — John3136, Mar 01 '16 at 08:00
I agree clarity is important. But as for operation complexity (and speed), i'm not sure. — Adl A, Mar 01 '16 at 08:16

YSC · Answer 1 · 2019-12-16T19:42:10.783

Arithmetic solution

If one defined what your functions should return, she would describe it as something close as "f(a, b) returns the closest integer of the result of the division of a by b in the real divisor ring."

Thus, the question can be summarized as: can we define this closest integer using only integer division. I think we can.

There is exactly two candidates as the closest integer: a / b and (a / b) + 1(1). The selection is easy, if a % b is closer to 0 as it is to b, then a / b is our result. If not, (a / b) + 1 is.

One could then write something similar to, ignoring optimization and good practices:

int divide(int a, int b)
{
    const int quot = a / b;
    const int rem = a % b;
    int result;

    if (rem < b - rem) {
        result = quot;
    } else {
        result = quot + 1;
    }
    return result;
}

While this definition satisfies out needs, one could optimize it by not computing two times the division of a by b with the use of std::div():

int divide(int a, int b)
{
    const std::div_t dv = std::div(a, b);
    int result = dv.quot;

    if (dv.rem >= b - dv.rem) {
        ++result;
    }
    return result;
}

The analysis of the problem we did earlier assures us of the well defined behaviour of our implementation.

(1)There is just one last thing to check: how does it behaves when a or b is negative? This is left to the reader ;).

Benchmark

#include <iostream>
#include <iomanip>
#include <string>

// solutions
#include <cmath>
#include <cstdlib>

// benchmak
#include <limits>
#include <random>
#include <chrono>
#include <algorithm>
#include <functional>

//
// Solutions
//
namespace
{
    int round_divide_by_float_casting(int a, int b) {
        return  (int)roundf(a / (float)b);
    }

    int round_divide_by_modulo(int a, int b) {
        return a / b + a % b * 2 / b;
    }

    int divide_by_quotient_comparison(int a, int b)
    {
        const std::div_t dv = std::div(a, b);
        int result = dv.quot;

        if (dv.rem >= b - dv.rem)
        {
            ++result;
        }
        return result;
    }
}

//
// benchmark
//
class Randomizer
{
    std::mt19937 _rng_engine;
    std::uniform_int_distribution<int> _distri;

public:
    Randomizer() : _rng_engine(std::time(0)), _distri(std::numeric_limits<int>::min(), std::numeric_limits<int>::max())
    {
    }

    template<class ForwardIt>
    void operator()(ForwardIt begin, ForwardIt end)
    {
        std::generate(begin, end, std::bind(_distri, _rng_engine));
    }
};

class Clock
{
    std::chrono::time_point<std::chrono::steady_clock> _start;

public:
    static inline std::chrono::time_point<std::chrono::steady_clock> now() { return std::chrono::steady_clock::now(); }

    Clock() : _start(now())
    {
    }

    template<class DurationUnit>
    std::size_t end()
    {
        return std::chrono::duration_cast<DurationUnit>(now() - _start).count();
    }
};

//
// Entry point
//
int main()
{
    Randomizer randomizer;
    std::array<int, 1000> dividends; // SCALE THIS UP (1'000'000 would be great)
    std::array<int, dividends.size()> divisors;
    std::array<int, dividends.size()> results;
    randomizer(std::begin(dividends), std::end(dividends));
    randomizer(std::begin(divisors), std::end(divisors));

    {
        Clock clock;
        auto dividend = std::begin(dividends);
        auto divisor = std::begin(divisors);
        auto result = std::begin(results);
        for ( ; dividend != std::end(dividends) ; ++dividend, ++divisor, ++result)
        {
            *result = round_divide_by_float_casting(*dividend, *divisor);
        }
        const float unit_time = clock.end<std::chrono::nanoseconds>() / static_cast<float>(results.size());
        std::cout << std::setw(40) << "round_divide_by_float_casting(): " << std::setprecision(3) << unit_time << " ns\n";
    }
    {
        Clock clock;
        auto dividend = std::begin(dividends);
        auto divisor = std::begin(divisors);
        auto result = std::begin(results);
        for ( ; dividend != std::end(dividends) ; ++dividend, ++divisor, ++result)
        {
            *result = round_divide_by_modulo(*dividend, *divisor);
        }
        const float unit_time = clock.end<std::chrono::nanoseconds>() / static_cast<float>(results.size());
        std::cout << std::setw(40) << "round_divide_by_modulo(): " << std::setprecision(3) << unit_time << " ns\n";
    }
    {
        Clock clock;
        auto dividend = std::begin(dividends);
        auto divisor = std::begin(divisors);
        auto result = std::begin(results);
        for ( ; dividend != std::end(dividends) ; ++dividend, ++divisor, ++result)
        {
            *result = divide_by_quotient_comparison(*dividend, *divisor);
        }
        const float unit_time = clock.end<std::chrono::nanoseconds>() / static_cast<float>(results.size());
        std::cout << std::setw(40) << "divide_by_quotient_comparison(): " << std::setprecision(3) << unit_time << " ns\n";
    }
}

Outputs:

g++ -std=c++11 -O2 -Wall -Wextra -Werror main.cpp && ./a.out
       round_divide_by_float_casting(): 54.7 ns
              round_divide_by_modulo(): 24 ns
       divide_by_quotient_comparison(): 25.5 ns

Demo

The two arithmetic solutions' performances are not distinguishable (their benchmark converges when you scale the bench size up).

Hi, thanks for the suggestion. this is faster than **int->float->int** casting, but slower than the modulo arithmetic i posted. — Adl A, Mar 01 '16 at 08:58
@AdlA A good compiler would probably generate similar assembly when optimization is enabled. The difference lies in the readability and the ease one could find in _proving_ it behaves properly. — YSC, Mar 01 '16 at 09:01
I see what you mean. in essence `a % b * 2 / b` returns **0** or **1** , and it take little effort to see it is equivalent to `if (rem > b - rem) {return quot;} else {return quot + 1;}`. The difference it performance must lie elsewhere — Adl A, Mar 01 '16 at 09:14
The first code snippet was just given as a description. You should compare your arithmetic solution with the second definition of `divide(int,int)`. — YSC, Mar 01 '16 at 09:48
this is a technical question. I added a control benchmark to see how much the of the time was taken up by accessing and assigning memory, like so: `*results = 1;` what i have noticed is that the benchmark of the above is about 80% the values i get when i calculate the rounded integer... that would mean the actual difference needs to be multiplied fivefold. In any case, is this a desireable benchmark situation? — Adl A, Mar 01 '16 at 12:55
I don't understand your logic for choosing the closest integer. Why are you comparing the remainder to `b`? It produces fairly unexpected results. For example, if the two values divide evenly (take something as simple as `a = 4, b = 2`), then the result is rounded up to the next whole integer (*i.e.*, you get 3 instead of 2). A result like 5.1 would also be rounded up to 6. Arguably it is greater than 5, but it is certainly not closer to 6. Can you explain why you've chosen to handle the selection/rounding like this? — Cody Gray - on strike, Mar 07 '16 at 04:36
@CodyGray a type-o sneaked in the demonstration code, which I fixed. The code in the answer was good though. `divide(4, 2)` produces the expected result (`2`). — YSC, Mar 07 '16 at 09:17

Mats Petersson · Answer 2 · 2016-03-01T22:35:33.823

It would really depend on the processor, and the range of the integer which is better (and using double would resolve most of the range issues)

For modern "big" CPUs like x86-64 and ARM, integer division and floating point division are roughly the same effort, and converting an integer to a float or vice versa is not a "hard" task (and does the correct rounding directly in that conversion, at least), so most likely the resulting operations are.

atmp = (float) a;
btmp = (float) b;
resfloat = divide atmp/btmp;
return = to_int_with_rounding(resfloat)

About four machine instructions.

On the other hand, your code uses two divides, one modulo and a multiply, which is quite likely longer on such a processor.

tmp = a/b;
tmp1 = a % b;
tmp2 = tmp1 * 2;
tmp3 = tmp2 / b;
tmp4 = tmp + tmp3;

So five instructions, and three of those are "divide" (unless the compiler is clever enough to reuse a / b for a % b - but it's still two distinct divides).

Of course, if you are outside the range of number of digits that a float or double can hold without losing digits (23 bits for float, 53 bits for double), then your method MAY be better (assuming there is no overflow in the integer math).

On top of all that, since the first form is used by "everyone", it's the one that the compiler recognises and can optimise.

Obviously, the results depend on both the compiler being used and the processor it runs on, but these are my results from running the code posted above, compiled through clang++ (v3.9-pre-release, pretty close to released 3.8).

   round_divide_by_float_casting(): 32.5 ns
          round_divide_by_modulo(): 113 ns
   divide_by_quotient_comparison(): 80.4 ns

However, the interesting thing I find when I look at the generated code:

xorps   %xmm0, %xmm0
cvtsi2ssl   8016(%rsp,%rbp), %xmm0
xorps   %xmm1, %xmm1
cvtsi2ssl   4016(%rsp,%rbp), %xmm1
divss   %xmm1, %xmm0
callq   roundf
cvttss2si   %xmm0, %eax
movl    %eax, 16(%rsp,%rbp)
addq    $4, %rbp
cmpq    $4000, %rbp             # imm = 0xFA0
jne .LBB0_7

is that the round is actually a call. Which really surprises me, but explains why on some machines (particularly more recent x86 processors), it is faster.

g++ gives better results with -ffast-math, which gives around:

  round_divide_by_float_casting(): 17.6 ns
          round_divide_by_modulo(): 43.1 ns
   divide_by_quotient_comparison(): 18.5 ns

(This is with increased count to 100k values)

Thank you for the explanation. I went ahead and did some testing. on my machine (i7) with vs2015 the modulo arithmetic was about twice as fast. There must be some operations that are hidden in the **roundf()** — Adl A, Mar 01 '16 at 09:00
Mats thanks for updating your answer. It is very interesting how in your case the division by modulo arithmetic was in fact the slowest. In all my benchmarks - and seemingly also those made by @YSC - it was the division by casting rounded floats which was the slowest (and for you the fastest). I'm new to c++ and compiler optimizations , etc. , but i find it fascinating how much fluctuation there is in performance. It would be great to someday understand why... Cheers! — Adl A, Mar 02 '16 at 06:21

score 1 · Answer 3 · edited Dec 07 '19 at 10:54

1

Prefer the standard solution. Use std::div family of functions declared in cstdlib.

See: http://en.cppreference.com/w/cpp/numeric/math/div

Casting to float and then to int may be very inefficient on some architectures, for example, microcontrollers.

edited Dec 07 '19 at 10:54

Peter Mortensen

30,738
21
105
131

answered Mar 01 '16 at 08:02

teroi

1,087
10
19

Thanks for the suggestion. I knew there were plenty of already made routines to do such a basic task. but my question is whether doing it by casting (the popular way) is more reliable and/or faster than by modulo arithmetic (the unpopular way) – Adl A Mar 01 '16 at 08:21
1

just to add that after some testing, this is faster than **int->float->int** casting, but slower than the modulo arithmetic i posted. – Adl A Mar 01 '16 at 09:01
The benchmarks will probably be very much HW architecture specific. Your modulo solution is probably faster than the casting way in most architectures. – teroi Mar 02 '16 at 11:14

score 0 · Answer 4 · edited Dec 07 '19 at 11:04

Thanks for the suggestions so far. To shed some light I made a test setup to compare performance.

#include <iostream>
#include <string>
#include <cmath>
#include <cstdlib>
#include <chrono>

using namespace std;

int round_divide_by_float_casting(int a, int b) {
    return  (int)roundf(a / (float)b);
}

int round_divide_by_modulo(int a, int b) {
    return a / b + a % b * 2 / b;
}

int divide_by_quotient_comparison(int a, int b)
{
    const std::div_t dv = std::div(a, b);
    int result = dv.quot;

    if (dv.rem <= b - dv.rem) {
        ++result;
    }
    return result;
}

int main()
{
    int itr = 1000;

    //while (true) {
        auto begin = chrono::steady_clock::now();
        for (int i = 0; i < itr; i++) {
            for (int j = 10; j < itr + 1; j++) {
                divide_by_quotient_comparison(i, j);
            }
        }
        auto end = std::chrono::steady_clock::now();
        cout << "divide_by_quotient_comparison(,) function took: "
             << chrono::duration_cast<std::chrono::nanoseconds>(end - begin).count()
             << endl;

        begin = chrono::steady_clock::now();
        for (int i = 0; i < itr; i++) {
            for (int j = 10; j < itr + 1; j++) {
                round_divide_by_float_casting(i, j);
            }
        }
        end = std::chrono::steady_clock::now();
        cout << "round_divide_by_float_casting(,) function took: "
             << chrono::duration_cast<std::chrono::nanoseconds>(end - begin).count()
             << endl;

        begin = chrono::steady_clock::now();
        for (int i = 0; i < itr; i++) {
            for (int j = 10; j < itr + 1; j++) {
                round_divide_by_modulo(i, j);
            }
        }
        end = std::chrono::steady_clock::now();
        cout << "round_divide_by_modulo(,) function took: "
             << chrono::duration_cast<std::chrono::nanoseconds>(end - begin).count()
             << endl;
    //}

    return 0;
}

The results I got on my machine (i7 with Visual Studio 2015) was as follows: the modulo arithmetic was about twice as fast as the int → float → int casting method. The method relying on std::div_t (suggested by @YSC and @teroi) was faster than the int → float → int, but slower than the modulo arithmetic method.

A second test was performed to avoid certain compiler optimizations pointed out by @YSC:

#include <iostream>
#include <string>
#include <cmath>
#include <cstdlib>
#include <chrono>
#include <vector>
using namespace std;

int round_divide_by_float_casting(int a, int b) {
    return  (int)roundf(a / (float)b);
}

int round_divide_by_modulo(int a, int b) {
    return a / b + a % b * 2 / b;
}

int divide_by_quotient_comparison(int a, int b)
{
    const std::div_t dv = std::div(a, b);
    int result = dv.quot;

    if (dv.rem <= b - dv.rem) {
        ++result;
    }
    return result;
}

int main()
{
    int itr = 100;
    vector <int> randi, randj;
    for (int i = 0; i < itr; i++) {
        randi.push_back(rand());
        int rj = rand();
        if (rj == 0)
            rj++;
        randj.push_back(rj);
    }
    vector<int> f, m, q;

    while (true) {
        auto begin = chrono::steady_clock::now();
        for (int i = 0; i < itr; i++) {
            for (int j = 0; j < itr; j++) {
                q.push_back( divide_by_quotient_comparison(randi[i] , randj[j]) );
            }
        }
        auto end = std::chrono::steady_clock::now();
        cout << "divide_by_quotient_comparison(,) function took: "
             << chrono::duration_cast<std::chrono::nanoseconds>(end - begin).count()
             << endl;

        begin = chrono::steady_clock::now();
        for (int i = 0; i < itr; i++) {
            for (int j = 0; j < itr; j++) {
                f.push_back( round_divide_by_float_casting(randi[i], randj[j]) );
            }
        }
        end = std::chrono::steady_clock::now();
        cout << "round_divide_by_float_casting(,) function took: "
             << chrono::duration_cast<std::chrono::nanoseconds>(end - begin).count()
             << endl;

        begin = chrono::steady_clock::now();
        for (int i = 0; i < itr; i++) {
            for (int j = 0; j < itr; j++) {
                m.push_back( round_divide_by_modulo(randi[i], randj[j]) );
            }
        }
        end = std::chrono::steady_clock::now();
        cout << "round_divide_by_modulo(,) function took: "
             << chrono::duration_cast<std::chrono::nanoseconds>(end - begin).count()
             << endl;
        cout << endl;

        f.clear();
        m.clear();
        q.clear();
    }

    return 0;
}

In this second test the slowest was the divide_by_quotient() reliant on std::div_t, followed by divide_by_float(), and the fastest again was the divide_by_modulo(). However this time the performance difference was much, much lower, less than 20%.

Your benchmark is strongly flawned: the compiler is way _too smart_: it rearranges the loop orders and optimises out side-effectless same-computation. You should try with random data. — YSC, Mar 01 '16 at 09:52
Thanks for the suggestion. did you manage to try with random data? I will try it shortly — Adl A, Mar 01 '16 at 10:10
In this second test the slowest was the `divide_by_quotient()` reliant on **std::div_t**, followed by `divide_by_float()` reliant of **roundf()**, and the fastest again was the `divide_by_modulo()`. however this time the performance difference between the slowest and the fastest was much, much lower, less than 20%. — Adl A, Mar 01 '16 at 10:37

Rounding integers routine

4 Answers4

Arithmetic solution

Benchmark

Linked