How to eliminate intermediate container for parallel std::transform_reduce()?

Question

Frequently, I have to find Sum( f(i), 1, N ) or Product( f(i), 1, N ), where f(i) is computationally CPU-intensive, while integral i is from sequential range but huge.

Using C++20 compiler I can write function:

uint64_t solution(uint64_t N)
{
    std::vector<uint64_t> v(N);
    std::iota(v.begin(), v.end(), 1ULL);

    return std::transform_reduce(
                std::execution::par, 
                v.cbegin(), v.cend(), 
                0ull, 
                std::plus<>(), 
                []f(const uint64_t& i)->uint64_t {
                   uint64_t result(0);
                   // expensive computation of result=f(i) goes here
                   // ...
                   return result;
                 });  

}

But that will be RAM constrained.

How I can completely eliminate intermediate memory operations with input vector in run-time using only C++20 STL (i.e. no vendor specific or 3rd party libraries) and yet have efficient parallel execution?

[`std::ranges::iota_view`](https://en.cppreference.com/w/cpp/ranges/iota_view)? — Jarod42, Nov 26 '20 at 13:13
@Jarod42 There's a problem that it uses a sentinel end iterator which doesn't work with the traditional iterator algorithms which require iterators of same type, and there appears to not be a `std::ranges::transform_reduce`. Maybe there's a need for a proposal to add those. — eerorika, Nov 26 '20 at 13:26
I haven't done so because I rather use the third party stuff, but shouldn't it be rather easy to implement a counting_iterator (that's how it is called in CUDA's Thrust library)? I mean it's just a wrapper around an integral type with the right interface, right? — paleonix, Nov 26 '20 at 14:18
@Jarod42 as I understood it had missed C++20 (see [link](https://stackoverflow.com/a/61860160/14559358)), as mentioned, there are number of questions to iterators. I am looking to some portable way to avoid memory allocation altogether, yet keeping efficient parallel execution under the hood — Yuli Kolesnikov, Nov 26 '20 at 14:45
Yes, Boost also has one (probably first), but I have used the thrust one more, as the thrust iterators are very powerful when used together for parallel computing (e.g. nesting counting, zip and transform iterators). No idea if the Boost/Thrust counting iterators work with the C++17 execution policies, but I think that they should because there isn't much to them. — paleonix, Nov 26 '20 at 15:01

paleonix · Accepted Answer · 2020-11-29T17:24:19.907

Disclaimer: I have no prior experience in implementing iterators or in C++20

This seems to work for me with gcc 10.1 and -std=c++2a. I put this together in very short time without putting much thought into it, so the implementation can certainly be improved, if only by templatizing it. If operator<=> is exchanged for the old two-way comparison operators, this should also run with C++17, but I haven't tested it. If you find any errors or easily correctable design flaws, you are welcome to comment them below, such that this answer can be improved.

#include <cstddef>

#if __cplusplus > 201703L
#include <compare>
#endif

#include <execution>
#include <iostream>
#include <iterator>
#include <numeric>

class counting_iterator {
public:
  typedef std::ptrdiff_t difference_type;
  typedef std::ptrdiff_t value_type;
  typedef void pointer;
  typedef void reference;
  typedef std::random_access_iterator_tag iterator_category;

private:
  value_type val_{0};

public:
  counting_iterator() = default;
  explicit counting_iterator(value_type init) noexcept : val_{init} {}
  value_type operator*() const noexcept { return val_; }
  value_type operator[](difference_type index) const noexcept {
    return val_ + index;
  }
  counting_iterator &operator++() noexcept {
    ++val_;
    return *this;
  }
  counting_iterator operator++(int) noexcept {
    counting_iterator res{*this};
    ++(*this);
    return res;
  }
  counting_iterator &operator--() noexcept {
    --val_;
    return *this;
  }
  counting_iterator operator--(int) noexcept {
    counting_iterator res{*this};
    --(*this);
    return res;
  }
  friend counting_iterator operator+(counting_iterator const &it,
                                     difference_type const &offset) noexcept;
  friend counting_iterator operator+(difference_type const &offset,
                                     counting_iterator const &it) noexcept;
  friend counting_iterator operator-(counting_iterator const &it,
                                     difference_type const &offset) noexcept;
  friend difference_type operator-(counting_iterator const &a,
                                   counting_iterator const &b) noexcept;
  counting_iterator &operator+=(difference_type offset) noexcept {
    val_ += offset;
    return *this;
  }
  counting_iterator &operator-=(difference_type offset) noexcept {
    val_ -= offset;
    return *this;
  }
  friend bool operator==(counting_iterator const &a,
                         counting_iterator const &b) noexcept;
#if __cplusplus > 201703L
  friend std::strong_ordering operator<=>(counting_iterator const &a,
                                          counting_iterator const &b);
#else
  friend bool operator!=(counting_iterator const &a,
                         counting_iterator const &b) noexcept;
  friend bool operator<=(counting_iterator const &a,
                         counting_iterator const &b) noexcept;
  friend bool operator>=(counting_iterator const &a,
                         counting_iterator const &b) noexcept;
  friend bool operator<(counting_iterator const &a,
                        counting_iterator const &b) noexcept;
  friend bool operator>(counting_iterator const &a,
                        counting_iterator const &b) noexcept;
#endif
};

counting_iterator
operator+(counting_iterator const &it,
          counting_iterator::difference_type const &offset) noexcept {
  return counting_iterator{it.val_ + offset};
}
counting_iterator operator+(counting_iterator::difference_type const &offset,
                            counting_iterator const &it) noexcept {
  return counting_iterator{it.val_ + offset};
}
counting_iterator
operator-(counting_iterator const &it,
          counting_iterator::difference_type const &offset) noexcept {
  return counting_iterator{it.val_ - offset};
}
counting_iterator::difference_type
operator-(counting_iterator const &a, counting_iterator const &b) noexcept {
  return a.val_ - b.val_;
}
bool operator==(counting_iterator const &a,
                counting_iterator const &b) noexcept {
  return a.val_ == b.val_;
}
#if __cplusplus > 201703L
std::strong_ordering operator<=>(counting_iterator const &a,
                                 counting_iterator const &b) {
  return a.val_ <=> b.val_;
}
#else
bool operator!=(counting_iterator const &a,
                counting_iterator const &b) noexcept {
  return a.val_ != b.val_;
}
bool operator<=(counting_iterator const &a,
                counting_iterator const &b) noexcept {
  return a.val_ <= b.val_;
}
bool operator>=(counting_iterator const &a,
                counting_iterator const &b) noexcept {
  return a.val_ >= b.val_;
}
bool operator<(counting_iterator const &a,
               counting_iterator const &b) noexcept {
  return a.val_ < b.val_;
}
bool operator>(counting_iterator const &a,
               counting_iterator const &b) noexcept {
  return a.val_ > b.val_;
}
#endif

int main() {
    auto res = std::transform_reduce(
                std::execution::par, 
                counting_iterator(0), counting_iterator(10), 
                0L, 
                std::plus<>(), 
                [](const std::ptrdiff_t& i) { return i * i; });

    std::cout << res << std::endl;
}

EDIT: I worked over the class to make it usable with C++17 as well. Now it also explicitly typedefs the std::random_access_iterator_tag. I still don't get any parallel computing with that execution policy, neither with the iterator nor with the vector, so I don't know if there is anything about the class itself inhibiting parallel execution.

Thank you! After some massaging and experiments I am confirming that below implementation works — Yuli Kolesnikov, Nov 27 '20 at 11:44

Yuli Kolesnikov · Answer 2 · 2020-11-30T19:10:56.243

0

After some massaging and experiments I am confirming that bidirectional iterator, based on sample from Paul above, had worked:

class counting_iterator {
public:
    using iterator_category = std::bidirectional_iterator_tag;
    using difference_type = std::ptrdiff_t;
    using value_type = std::ptrdiff_t;
private:
    value_type val_;
public:
    counting_iterator() : val_(0) {}
    explicit counting_iterator(value_type init) : val_(init) {}

    value_type operator*() noexcept { return val_; }
    const value_type& operator*() const noexcept { return val_; }

    counting_iterator& operator++() noexcept { ++val_; return *this; }
    counting_iterator operator++(int) noexcept { counting_iterator res{ *this }; ++(*this); return res; }

    counting_iterator& operator--() noexcept { --val_; return *this; }
    counting_iterator operator--(int) noexcept { counting_iterator res{ *this }; --(*this); return res; }

    value_type operator[](difference_type index) noexcept { return val_ + index; }

    counting_iterator& operator+=(difference_type offset) noexcept { val_ += offset; return *this; }
    counting_iterator& operator-=(difference_type offset) noexcept { val_ -= offset; return *this; }

    counting_iterator operator+(difference_type offset) const noexcept { return counting_iterator{ *this } += offset; };
    /*counting_iterator& operator+(difference_type offset) noexcept { return operator+=(offset); }*/

    counting_iterator operator-(difference_type offset) const noexcept { return counting_iterator{ *this } -= offset; };

    /*counting_iterator& operator-(difference_type offset) noexcept { return operator-=(offset); }*/

    difference_type operator-(counting_iterator const& other) noexcept { return val_ - other.val_; }

    bool operator<(counting_iterator const& b) const noexcept { return val_ < b.val_; }
    bool operator==(counting_iterator const& b) const noexcept { return val_ == b.val_; }
    bool operator!=(counting_iterator const& b) const noexcept { return !operator==(b); }

    std::strong_ordering operator<=>(counting_iterator const& b) const noexcept { return val_ <=> b.val_; }
};

I could not make it work though in parallel std::transform_reduce with iterator_category = std::random_access_iterator_tag, and that I believe is the reason for the performance drop.

UPD: In the code above commented lines made MS compiler choosing them instead of copy version alternative and that made a havoc during parallel execution if iterator was marked as random_access_category_tag.

edited Nov 30 '20 at 19:10

answered Nov 27 '20 at 12:00

Yuli Kolesnikov

55
4

Getting stuff to run in parallel with these execution policies seems to be a problem in general. Even the version using std::vector doesn't seem to use multithreading for me. For sequential execution and big N, the counting iterator is much faster, but already at N=2^24~17e6 the vector is actually (slightly) faster on my system (only measuring the transform_reduce, not the iota). I guess that this is just a sign how great prefetching and caches work... – paleonix Nov 27 '20 at 15:26
For the time being, Thrust with OpenMP/TBB backend (you don't have to use a GPU) seems to be the way to go in terms of having a parallel STL. Although I don't have enough experience with using Thrust w/o a GPU to say if these backends are well-developed enough to be used in production code either. – paleonix Nov 27 '20 at 15:30
I reworked the code in my answer a bit, if you want to try it out. As I can't get any parallel performance even with the vector, I can't really test it for that. What is your reasoning behind using the ```bidirectional_iterator_tag``` instead? Was it not compiling with the random access one? Because mine does. A bidirectional iterator should not be enough for (efficient) parallel processing, as it makes it way harder to tile the input and distribute the work. – paleonix Nov 28 '20 at 19:33
@Paul your limit (10) is most likely less than number of logical processors, that is probably why you have not seen improvement or wrong. Try to increase is to 1000. First run with std::execution::seq policy to establish baseline (and the right answer), then change it to std::execution::par_unseq. You should see the difference. – Yuli Kolesnikov Nov 29 '20 at 12:46
@Paul For now I've sticked to `boost::counting_iterator`, but I am open to any improvement, as indeed I have not managed to make your previous implementation to work with random_access category, it is much faster but give me wrong answer, probably because your iterator implementation is not really thread safe, and the best I have had was with category bidirectional (in reality in this sample it is equivalent to forward). I am using MSVS2019 18.2, and its compiler cl.exe has version 19.28.29334 for x64. Dell XPS 15 9570, 12 logical cores, OS: Win 10 Pro 10.0.19042 – Yuli Kolesnikov Nov 29 '20 at 12:47
10 was only the test case, as I know the result there. I have tried much bigger numbers especially for benchmarking. I don't think that this has to do with thread safety, because the iterator isn't more or less safe than a pointer. With an random access iterator, I would hope that the master thread just gives any other thread their own begin and end iterators by adding to the overall begin iterator (therefore everyone should be working on private copies instead of one shared object). With bidirectional iterator, this isn't possible. – paleonix Nov 29 '20 at 16:18
With unseq I get a bit of speedup, but nothing crazy. The par version is pretty much as fast as seq and par_unseq is slower than unseq (same as unseq with vector). All of this uses N=2^24. The results are all equal, so for me the random access works. It just doesn't seem to use more than one of my 4 cores for some reason. – paleonix Nov 29 '20 at 17:23
1

@Paul, I found problem in my published version of iterator that breaking random_access execution policy. I've also managed to make your version to compile with minimum modification. There is significant improvement over bidirectional iterator in parallel execution (at N=1000000 it is more than 4 times). – Yuli Kolesnikov Nov 30 '20 at 18:49

How to eliminate intermediate container for parallel std::transform_reduce()?

2 Answers2

Linked