Templatized branchless int max/min function

Question

I'm trying to write a branchless function to return the MAX or MIN of two integers without resorting to if (or ?:). Using the usual technique I can do this easily enough for a given word size:

inline int32 imax( int32 a, int32 b )
{
    // signed for arithmetic shift
    int32 mask = a - b;
    // mask < 0 means MSB is 1.
    return a + ( ( b - a ) & ( mask >> 31 ) );
}

Now, assuming arguendo that I really am writing the kind of application on the kind of in-order processor where this is necessary, my question is whether there is a way to use C++ templates to generalize this to all sizes of int.

The >>31 step only works for int32s, of course, and while I could copy out overloads on the function for int8, int16, and int64, it seems like I should use a template function instead. But how do I get the size of a template argument in bits?

Is there a better way to do it than this? Can I force the mask T to be signed? If T is unsigned the mask-shift step won't work (because it'll be a logical rather than arithmetic shift).

template< typename T > 
inline T imax( T a, T b )
{
    // how can I force this T to be signed?
    T mask = a - b;
    // I hope the compiler turns the math below into an immediate constant!
    mask = mask >> ( (sizeof(T) * 8) - 1 );
    return a + ( ( b - a ) & mask );
}

And, having done the above, can I prevent it from being used for anything but an integer type (eg, no floats or classes)?

Most modern machines have conditonal mov instructions, that enable them to do min/max with no branches (eg., cmp a,b/movlt a,b). This would be faster than the code you plan to generate, and the compilers know about them. Are you sure your compiler doesn't already do this for you? — Ira Baxter, Nov 29 '12 at 03:12
@IraBaxter Absolutely sure; I always look at its assembly ouput. Also, the processor I target (A PowerPC derivative) definitely hasn't got a cmov. — Crashworks, Nov 29 '12 at 03:16
Whatever code you write, it will be branchless only as c++ source. Compiler may generate conditional jumps (ie branches) without writing if/else/?/: , and conversely may generate optimized branchless instructions from if/else source. — galinette, Dec 08 '15 at 15:39
Unless I'm mistaken there is a bug when a=0,b=INT_MIN, yes? as (a-b) == INT_MIN, so mask is -1, so (b-a) & mask == INT_MIN, so result == 0 + INT_MIN. If you're going ahead with it anyway, the "in theory" optimized code is probably ```mov eax,; sub eax,; cdq; and eax,edx; add eax,;```, using the sign-extend register pair instruction to create the mask in edx. In case that's interesting. — l.k, Mar 11 '21 at 05:12
“[W]ith modern CPUs it is more about making your code more predictable so that the cache can predict what to load next and which branches you're more likely to take. So in a way, as CPUs get smarter, you want to make your code ‘dumber’ (i.e. more predictable) in order to get the best performance. When hardware was ‘dumber’, it was better to make your code smarter.” — Jonathan Marler in the D language forum. If you write the well-known version with `?:`, any optimizing compiler knows what you’re up to and how to give you the best version of it. — Quirin F. Schroll, Feb 10 '23 at 09:58

Evan Teran · Accepted Answer · 2020-03-02T03:07:34.903

EDIT: This answer is from before C++11. Since then, C++11 and later has offered make_signed<T> and much more as part of the standard library

Generally, looks good, but for 100% portability, replace that 8 with CHAR_BIT (or numeric_limits<char>::max()) since it isn't guaranteed that characters are 8-bit.

Any good compiler will be smart enough to merge all of the math constants at compile time.

You can force it to be signed by using a type traits library. which would usually look something like (assuming your numeric_traits library is called numeric_traits):

typename numeric_traits<T>::signed_type x;

An example of a manually rolled numeric_traits header could look like this: http://rafb.net/p/Re7kq478.html (there is plenty of room for additions, but you get the idea).

or better yet, use boost:

typename boost::make_signed<T>::type x;

EDIT: IIRC, signed right shifts don't have to be arithmetic. It is common, and certainly the case with every compiler I've used. But I believe that the standard leaves it up the compiler whether right shifts are arithmetic or not on signed types. In my copy of the draft standard, the following is written:

The value of E1 >> E2 is E1 rightshifted E2 bit positions. If E1 has an unsigned type or if E1 has a signed type and a nonnegative value, the value of the result is the integral part of the quotient of E1 divided by the quantity 2 raised to the power E2. If E1 has a signed type and a negative value, the resulting value is implementation defined.

But as I said, it will work on every compiler I've seen :-p.

My mind shudders to imagine what might lay in the heart of the compiler implementor who chooses not to preserve sign. — Crashworks, Feb 05 '09 at 04:23
+1 for mentioning CHAR_BIT and the implementation-definedness of signed right shifts (both news to me), but note that automatic template type deduction cannot deduce T for a type such as "numeric_traits::signed_type" -- you'll need to use enable_if for this instead. (As mentioned by grepsedawk.) — j_random_hacker, Feb 05 '09 at 12:02
@j_random_hacker: I don't see why it wouldn't work if you did: int x = imax(5, 4); no need for enable_if — Evan Teran, Feb 05 '09 at 17:18
I don't see how Boost is better than the pure-C++ equivalent. — S.S. Anne, Feb 28 '20 at 18:51
@S.S.Anne My answer is from before C++11, and therefore before there was a "pure C++ equivalent" — Evan Teran, Mar 02 '20 at 03:01

score 4 · Answer 2 · answered Nov 29 '12 at 02:12

4

Here's another approach for branchless max and min. What's nice about it is that it doesn't use any bit tricks and you don't have to know anything about the type.

template <typename T> 
inline T imax (T a, T b)
{
    return (a > b) * a + (a <= b) * b;
}

template <typename T> 
inline T imin (T a, T b)
{
    return (a > b) * b + (a <= b) * a;
}

answered Nov 29 '12 at 02:12

Ambroz Bizjak

7,809
1
38
49

2

Unfortunately on the PowerPC, integer multiplication is a microcoded operation that stops the pipeline dead, and is even slower than a mispredicted branch. – Crashworks Nov 29 '12 at 05:44
2

@Crashworks I tried this in some program on x86_64, and it was indeed slower than the usual branch approach. – Ambroz Bizjak Nov 29 '12 at 12:16
What about `(-(a<=b) & a) | (-(b<=a) & b)` ? – Todd Lehman Jun 06 '15 at 20:02
@AmbrozBizjak It depends on your test code. Did you have a condition that was always true or always false? For a fair test, the condition should be about 50% true and 50% false, and not all in a row. Should be randomly distributed. – jjxtra Nov 25 '16 at 16:51
@jjxtra I tried it with randomly generated data and both methods performed roughly the same. I also tried finding the minimum of an array by repeatedly calling min on the current smallest and the next array value. The branched version performed much faster, presumably because the branch prediction can just predict "not smaller" with very high rate of success in that case. – Tomas Wilson Jun 07 '23 at 21:35

score 4 · Answer 3 · answered Nov 18 '18 at 01:08

tl;dr

To achieve your goals, you're best off just writing this:

template<typename T> T max(T a, T b) { return (a > b) ? a : b; }

Long version

I implemented both the "naive" implementation of max() as well as your branchless implementation. Both of them were not templated, and I instead used int32 just to keep things simple, and as far as I can tell, not only did Visual Studio 2017 make the naive implementation branchless, it also produced fewer instructions.

Here is the relevant Godbolt (and please, check the implementation to make sure I did it right). Note that I'm compiling with /O2 optimizations.

Admittedly, my assembly-fu isn't all that great, so while NaiveMax() had 5 fewer instructions and no apparent branching (and inlining I'm honestly not sure what's happening) I wanted to run a test case to definitively show whether the naive implementation was faster or not.

So I built a test. Here's the code I ran. Visual Studio 2017 (15.8.7) with "default" Release compiler options.

#include <iostream>
#include <chrono>

using int32 = long;
using uint32 = unsigned long;

constexpr int32 NaiveMax(int32 a, int32 b)
{
    return (a > b) ? a : b;
}

constexpr int32 FastMax(int32 a, int32 b)
{
    int32 mask = a - b;
    mask = mask >> ((sizeof(int32) * 8) - 1);
    return a + ((b - a) & mask);
}

int main()
{
    int32 resInts[1000] = {};

    int32 lotsOfInts[1'000];
    for (uint32 i = 0; i < 1000; i++)
    {
        lotsOfInts[i] = rand();
    }

    auto naiveTime = [&]() -> auto
    {
        auto start = std::chrono::high_resolution_clock::now();

        for (uint32 i = 1; i < 1'000'000; i++)
        {
            const auto index = i % 1000;
            const auto lastIndex = (i - 1) % 1000;
            resInts[lastIndex] = NaiveMax(lotsOfInts[lastIndex], lotsOfInts[index]);
        }

        auto finish = std::chrono::high_resolution_clock::now();
        return std::chrono::duration_cast<std::chrono::nanoseconds>(finish - start).count();
    }();

    auto fastTime = [&]() -> auto
    {
        auto start = std::chrono::high_resolution_clock::now();

        for (uint32 i = 1; i < 1'000'000; i++)
        {
            const auto index = i % 1000;
            const auto lastIndex = (i - 1) % 1000;
            resInts[lastIndex] = FastMax(lotsOfInts[lastIndex], lotsOfInts[index]);
        }

        auto finish = std::chrono::high_resolution_clock::now();
        return std::chrono::duration_cast<std::chrono::nanoseconds>(finish - start).count();
    }();

    std::cout << "Naive Time: " << naiveTime << std::endl;
    std::cout << "Fast Time:  " << fastTime << std::endl;

    getchar();

    return 0;
}

And here's the output I get on my machine:

Naive Time: 2330174
Fast Time:  2492246

I've run it several times getting similar results. Just to be safe, I also changed the order in which I conduct the tests, just in case it's the result of a core ramping up in speed, skewing the results. In all cases, I get similar results to the above.

Of course, depending on your compiler or platform, these numbers may all be different. It's worth testing yourself.

The Answer

In brief, it would seem that the best way to write a branchless templated max() function is probably to keep it simple:

template<typename T> T max(T a, T b) { return (a > b) ? a : b; }

There are additional upsides to the naive method:

It works for unsigned types.
It even works for floating types.
It expresses exactly what you intend, rather than needing to comment up your code describing what the bit-twiddling is doing.
It is a well known and recognizable pattern, so most compilers will know exactly how to optimize it, making it more portable. (This is a gut hunch of mine, only backed up by personal experience of compilers surprising me a lot. I'll be willing to admit I'm wrong here.)

I think this benchmark is a bit skewed, because it is entirely possible that the compiler just decides to do nothing, since you never actually query the results. On my machine (G++ 12.2 mingw64) the code runs in ~100ns until I add a routine that sums the resInts arrays and prints the sum. (Then it runs in 2085700ns/1606900 ns, respectively) Clearly this was not the case for your tests, but to make sure the compiler does no shenanigans in that regard, one should always query the results. — Tomas Wilson, Jun 07 '23 at 21:46

score 2 · Answer 4 · answered Feb 05 '09 at 03:52

2

You may want to look at the Boost.TypeTraits library. For detecting whether a type is signed you can use the is_signed trait. You can also look into enable_if/disable_if for removing overloads for certain types.

answered Feb 05 '09 at 03:52

grepsedawk

5,959
5
24
22

score 0 · Answer 5 · answered May 15 '19 at 05:49

I don't know what are the exact conditions for this bit mask trick to work but you can do something like

#include<type_traits>

template<typename T, typename = std::enable_if_t<std::is_integral<T>{}> > 
inline T imax( T a, T b )
{
   ...
}

Other useful candidates are std::is_[un]signed, std::is_fundamental, etc. https://en.cppreference.com/w/cpp/types

score 0 · Answer 6 · answered Apr 01 '21 at 20:03

In addition to tloch14's answer "tl;dr", one can also use an index into an array. This avoids the unwieldly bitshuffling of the "branchless min/max"; it's also generalizable to all types.

template<typename T> constexpr T OtherFastMax(const T &a, const T &b)
{
    const T (&p)[2] = {a, b};
    return p[a>b];
}

Templatized branchless int max/min function

6 Answers6

tl;dr

Long version

The Answer

Linked