Enumerating over a fold expression

Question

I have some auxiliary code that performs vector reshuffling using compile-time indices. It is of upmost importance that the generated code is as efficient as possible. I am relying on parameter packs with fold expressions, and I was wondering what is the best practice in writing such code.

A practical example: let there be a function insert which inserts the elements of container y into container x at positions Ii where the positions are compile-time constants. The basic signature of this function would be something like this:

template<size_t... Ii, size_t Xsize, size_t Size>
constexpr container<Xsize> insert(container<Xsize> x, container<Ysize> y);

And its invoked like this: insert<0, 2>(x, y). I see two obvious possibilities of implementing this.

First: using an auxiliary index variable to iterate over y:

template<size_t... Ii, size_t Xsize, size_t Size>
constexpr container<Xsize> insert(container<Xsize> x, container<Ysize> y) {
  int i = 0;
  ((x[Ii] = y[i++]), ...);
  return x;
}

My problem with this solution is the variable i: I have to rely on the compiler to optimise it out.

The second solution avoids any runtime dependencies, but it requires an auxiliary function, making the entire implementation rather ugly:

template<size_t... Ii, size_t... Yi, size_t Xsize, size_t Size>
constexpr container<Xsize> insert_(container<Xsize> x, container<Ysize> y, std::index_sequence<Yi...>) {
  ((x[Ii] = y[Yi]), ...);
  return x;
}

template<size_t... Ii, size_t Xsize, size_t Size>
constexpr container<Xsize> insert(container<Xsize> x, container<Ysize> y) {
  return insert_<Ii...>(x,y, std::make_index_sequence<sizeof...(Ii)> {});
}

Is there a way to get this done avoiding both runtime variables and an auxiliary function?

`std::make_index_sequence<>` was basically designed for stuff like this. Even though it requires an auxiliary function, I like your second solution. You can’t really make use of the index sequence without another function to receive it. — John Drouhard, Jan 19 '19 at 16:21
Do you have any reason to think the compiler _wont_ optimize the runtime variable? It's admittedly awkward, but if it gets the job done...? — Barry, Jan 19 '19 at 16:58
e.g. At -O1 there's [nothing](https://godbolt.org/z/DKRdt9). — Barry, Jan 19 '19 at 17:09
@Barry, yes the compilers are rather good at optimising it out. But its still not perfect sometimes. The thing is, this is for SIMD processing and most of these invocations should be compiled into one or two SIMD instructions anyway, I'm trying to work out some kinks that can potentially stop the optimiser from doing the best job. — MrMobster, Jan 20 '19 at 11:12

score 2 · Accepted Answer · answered Jan 19 '19 at 18:17

It is of upmost importance that the generated code is as efficient as possible.

Just a side note concerning your example: You should make sure that the performance does not suffer from passing function arguments by value. Same for the return value.

Is there a way to get this done avoiding both runtime variables and an auxiliary function?

You can implement reusable helper functions. As an example, consider the following code.

static_assert(__cplusplus >= 201703L, "example written for C++17 or later");

#include <cstddef>

#include <array>
#include <type_traits>
#include <utility>

namespace detail {

template<std::size_t... inds, class F>
constexpr void gen_inds_impl(std::index_sequence<inds...>, F&& f) {
  f(std::integral_constant<std::size_t, inds>{}...);
}

}// detail

template<std::size_t N, class F>
constexpr void gen_inds(F&& f) {
  detail::gen_inds_impl(std::make_index_sequence<N>{}, std::forward<F>(f));
}

// the code above is reusable

template<
  std::size_t... inds_out,
  class T, std::size_t size_out, std::size_t size_in
>
constexpr std::array<T, size_out> insert1(
  std::array<T, size_out> out,
  std::array<T, size_in> in
) {
  static_assert((... && (inds_out < size_out)));
  static_assert(sizeof...(inds_out) <= size_in);

  gen_inds<sizeof...(inds_out)>([&] (auto... inds_in) {
    ((out[inds_out] = in[inds_in]), ...);
  });

  return out;
}

A similar alternative is the static_for approach:

static_assert(__cplusplus >= 201703L, "example written for C++17 or later");

#include <cstddef>

#include <array>
#include <type_traits>
#include <utility>

namespace detail {

template<std::size_t... inds, class F>
constexpr void static_for_impl(std::index_sequence<inds...>, F&& f) {
  (f(std::integral_constant<std::size_t, inds>{}), ...);
}

}// detail

template<std::size_t N, class F>
constexpr void static_for(F&& f) {
  detail::static_for_impl(std::make_index_sequence<N>{}, std::forward<F>(f));
}

// the code above is reusable

template<
  std::size_t... inds_out,
  class T, std::size_t size_out, std::size_t size_in
>
constexpr std::array<T, size_out> insert2(
  std::array<T, size_out> out,
  std::array<T, size_in> in
) {
  static_assert(sizeof...(inds_out) >= 1);

  static_assert((... && (inds_out < size_out)));
  static_assert(sizeof...(inds_out) <= size_in);

  constexpr std::size_t N = sizeof...(inds_out);

  static_for<N>([&] (auto n) {
    // note the constexpr
    constexpr std::size_t ind_out = std::array{inds_out...}[n];
    constexpr std::size_t ind_in = n;
    out[ind_out] = in[ind_in];
  });

  return out;
}

I will look into labda-based approach, it could potentially solve some other design problems I am seeing. As to passing by value: all functions are explicitly declared as referentially transparent and will be forcibly inlined. The domain is SIMD programming and the idea is that calls to these functions are replaced by one or two CPU instructions anyway. — MrMobster, Jan 20 '19 at 11:15

max66 · Answer 2 · 2019-01-19T16:50:20.443

I don't think it's possible to do this avoiding both runtime variables and an auxiliary function (hoping someone can disprove this).

And I like very much your second solution but... what about using iterators for y (if y supports cbegin() and iterators, obviously).

Something as (caution: code not tested)

template <std::size_t Ii...., std::size_t Xsize, std::size_t Ysize>
constexpr container<Xsize> insert(container<Xsize> x, container<Ysize> const & y) {
   auto it = y.cbegin();
   ((x[Ii] = *it++), ...);
   return x;
}

It's almost your first solution but the access to y incrementing the iterator should be (I suppose, for sequential traversing, for some containers) more efficient (a little more efficient) than using operator[]().

But I also suppose that, with a good optimizer, there isn't a perceptible difference.

Enumerating over a fold expression

2 Answers2