5

So I'm trying to optimize some code. I have a function with a variable sized loop. However for efficiency sake I want to make cases with 1, 2 and 3 sized loops special cases that are completely unrolled. My approach so far is to declare the loop size as a const parameter then define wrapper functions that call the main function handing it a literal for the const value. I've included a code snip it illustrating the kind of thing i have in mind.

inline void someFunction (const int a)
{
    for (int i=0; i<a; i++)
    {
        // do something with i.
    }
}

void specialCase()
{
    someFunction (3);
}

void generalCase(int a)
{
    someFunction (a);
}

So my question is is it reasonable for me to expect my compiler (GCC) to unroll the for loop inside of specialCase. I'm mean obviously I can copy - paste the contents of someFunction into specialCase and replace a with 3 but I'd rather only deal with one definition of someFunction in my code for clarity sake.

p clark
  • 51
  • 1
  • 2
    There's [Godbolt](https://godbolt.org/) for this, test it out yourself. – Hatted Rooster Sep 18 '17 at 10:35
  • 4
    Will you really benefit from unrolling 1, 2, 3 sized short loops instead of trying to optimize long loops? – user7860670 Sep 18 '17 at 10:50
  • If you don't want to copy pase, why not make `do_something_with(i)` a separate (inline) function and let the compiler do the copy pasting for `do_something_with(1); do_something_with(2);`. – Bo Persson Sep 18 '17 at 11:05
  • 1
    yes because in my actual world example its not just one loop in the function, its several, some of them nested 3 or 4 loops deep. and this function will be called over and over and over, its the main bottle neck of the program. The previous version of the program had manually unrolled all the loops and was unreadable (and only supported 3 passes through that one class of loop, I want to support 1, 2 or 3 at least) – p clark Sep 18 '17 at 15:10
  • sadly godbolt suggest the loops will not be unrolled. – p clark Sep 18 '17 at 15:13

4 Answers4

3

However for efficiency sake I want to make cases with 1, 2 and 3 sized loops special cases that are completely unrolled.

Have you measured that this is actually faster? I doubt it will be (or that the compiler won't unroll the loop automatically).


My approach so far is to declare the loop size as a const parameter then define wrapper functions that call the main function handing it a literal for the const value.

const doesn't mean anything here. It won't affect the compiler's ability to unroll the loop. It just means that a cannot be mutated inside the function body, but it's still a runtime argument.


If you want to ensure unrolling, then force it. It's quite easy with C++17.

template <typename F, std::size_t... Is>
void repeat_unrolled_impl(F&& f, std::index_sequence<Is...>)
{
    (f(std::integral_constant<std::size_t, Is>{}), ...);
}

template <std::size_t Iterations, typename F>
void repeat_unrolled(F&& f)
{
    repeat_unrolled_impl(std::forward<F>(f), 
                         std::make_index_sequence<Iterations>{});
}

live example on godbolt

Vittorio Romeo
  • 90,666
  • 33
  • 258
  • 416
2

If you don't like templates and don't trust your compiler, there's always this method, which is inspired by the outdated method of manually unrolling loops called "duff's device":

void do_something(int i);

void do_something_n_times(int n)
{
    int i = 0;
    switch(n)
    {
        default:
            while(n > 3) {
                do_something(i++);
                --n;
            }
        case 3: do_something(i++);
        case 2: do_something(i++);
        case 1: do_something(i++);
    }
}

But I think it's worth saying that if you don't trust your compiler to do something so simple as loop unrolling for you, it's probably time to consider a new compiler.

Note that Duff's device was originally invented as a micro-optimisation strategy for programs compiled with compilers that did not automatically apply loop-unrolling optimisations.

It was invented by Tom Duff in 1983.

https://en.wikipedia.org/wiki/Duff%27s_device

Its use with modern compilers is questionable.

Richard Hodges
  • 68,278
  • 7
  • 90
  • 142
  • putting bits of the body of the main loop in inline functions is my last resort. Specifically inline because there would be too many parameters to pass otherwise. – p clark Sep 18 '17 at 15:22
  • @pclark I don't know your exact use case but don't be afraid to delegate parts of your function to lambdas, with variable capture, arguments or both. The compiler's optimiser should make short work of all that. – Richard Hodges Sep 18 '17 at 15:25
  • lambda calculus is not something i'm too familer with much less its implementation in c++. My knowledge of c++ 11 and up is very shaky. Hence i thought it best to ask here for some advice. My application is an ODE system solver for a class of ODEs of which the 1st, 2nd, and 3rd cases will be the most common. – p clark Sep 18 '17 at 15:35
  • @pclark delegating to a function or a lamba, if they're defined in the same file, is going to be as optimisable as the compiler can see all the code paths. Structuring your code into smaller functions can of course make code easier to follow. But you already knew that :) It's worth reiterating that c++ compilers have become very good at discerning your intent. The final code will take shortcuts that may surprise you, so don't worry too much about hand-optimisation. Often that gets in the way of the compiler. – Richard Hodges Sep 18 '17 at 15:38
  • I deleted my comments now, as they are not relevant anymore :-) – cmaster - reinstate monica Sep 19 '17 at 11:01
1

I'd rather go this way, if you're willing to use the force-inline (non-standard) feature of all popular compilers:

__attribute__((always_inline))
void bodyOfLoop(int i) {
  // put code here
}

void specialCase() {
    bodyOfLoop(0);
    bodyOfLoop(1);
    bodyOfLoop(2);
}

void generalCase(int a) {
    for (int i=0; i<a; i++) {
        bodyOfLoop(i);
    }
}

Note: this is GCC/Clang solution. Use __forceinline for MSVC.

geza
  • 28,403
  • 6
  • 61
  • 135
0

How about this C++20 unrolling-helpers:

#pragma once
#include <utility>
#include <concepts>
#include <iterator>

template<size_t N, typename Fn>
    requires (N >= 1) && requires( Fn fn, size_t i ) { { fn( i ) } -> std::same_as<void>; }
inline
void unroll( Fn fn )
{
    auto unroll_n = [&]<size_t ... Indices>( std::index_sequence<Indices ...> )
    {
        (fn( Indices ), ...);
    };
    unroll_n( std::make_index_sequence<N>() );
}

template<size_t N, typename Fn>
    requires (N >= 1) && requires( Fn fn ) { { fn() } -> std::same_as<void>; }
inline
void unroll( Fn fn )
{
    auto unroll_n = [&]<size_t ... Indices>( std::index_sequence<Indices ...> )
    {
        return ((Indices, fn()), ...);
    };
    unroll_n( std::make_index_sequence<N>() );
}

template<size_t N, typename Fn>
    requires (N >= 1) && requires( Fn fn, size_t i ) { { fn( i ) } -> std::convertible_to<bool>; }
inline
bool unroll( Fn fn )
{
    auto unroll_n = [&]<size_t ... Indices>( std::index_sequence<Indices ...> ) -> bool
    {
        return (fn( Indices ) && ...);
    };
    return unroll_n( std::make_index_sequence<N>() );
}

template<size_t N, typename Fn>
    requires (N >= 1) && requires( Fn fn ) { { fn() } -> std::convertible_to<bool>; }
inline
bool unroll( Fn fn )
{
    auto unroll_n = [&]<size_t ... Indices>( std::index_sequence<Indices ...> ) -> bool
    {
        return ((Indices, fn()) && ...);
    };
    return unroll_n( std::make_index_sequence<N>() );
}

template<std::size_t N, typename RandomIt, typename UnaryFunction>
    requires std::random_access_iterator<RandomIt>
    && requires( UnaryFunction fn, typename std::iterator_traits<RandomIt>::value_type elem ) { { fn( elem ) }; }
inline
RandomIt unroll_for_each( RandomIt begin, RandomIt end, UnaryFunction fn )
{
    RandomIt &it = begin;
    if constexpr( N > 1 )
        for( ; it + N <= end; it += N )
            unroll<N>( [&]( size_t i ) { fn( it[i] ); } );
    for( ; it < end; ++it )
        fn( *begin );
    return it;
}

But be aware that the unrolling-factor is crucial here. Unrolling is not always beneficial and sometimes unrolling beyond the optimal CPU-specific unrolling-factor drops to the performance without unrolling.

Bonita Montero
  • 2,817
  • 9
  • 22