0

I want to parallelize a function and have the problem that after a few hours my memory is overloaded.

The test program calculates something simple, and works so far. Only the memory usage is constantly increasing.

QT Project file:

QT -= gui
QT += concurrent widgets
CONFIG += c++11 console
CONFIG -= app_bundle
DEFINES += QT_DEPRECATED_WARNINGS
SOURCES += main.cpp

QT program file:

#include <QCoreApplication>
#include <qdebug.h>
#include <qtconcurrentrun.h>

double parallel_function(int instance){
    return (double)(instance)*10.0;
}

int main(int argc, char *argv[])
{
    QCoreApplication a(argc, argv);
    int nr_of_threads = 8;
    double result_sum,temp_var;
    for(qint32 i = 0; i<100000000; i++){
      QFuture<double> * future = new QFuture<double>[nr_of_threads];

      for(int thread = 0; thread < nr_of_threads; thread++){
        future[thread] = QtConcurrent::run(parallel_function,thread);
      }
      for(int thread = 0; thread < nr_of_threads; thread++){
        future[thread].waitForFinished();
        temp_var = future[thread].result();
        qDebug()<<"result: " << temp_var;
        result_sum += temp_var;
      }
  }

  qDebug()<<"total: "<<result_sum;
  return a.exec();
}

As I have observed, QtConcurrent::run(parallel_function,thread) allocates memory, but does not release memory after future[thread].waitForFinished().

What's wrong here?

Kuba hasn't forgotten Monica
  • 95,931
  • 16
  • 151
  • 313
metty
  • 55
  • 1
  • 7

2 Answers2

2

You have memory leak because future array is not deleted. Add delete[] future at the end of outer for loop.

for(qint32 i = 0; i<100000000; i++)
{
   QFuture<double> * future = new QFuture<double>[nr_of_threads];

   for(int thread = 0; thread < nr_of_threads; thread++){
     future[thread] = QtConcurrent::run(parallel_function,thread);
   }
   for(int thread = 0; thread < nr_of_threads; thread++){
      future[thread].waitForFinished();
      temp_var = future[thread].result();
      qDebug()<<"result: " << temp_var;
      result_sum += temp_var;
   }

   delete[] future; // <--
}
rafix07
  • 20,001
  • 3
  • 20
  • 33
  • 1
    Nobody is supposed to be writing C++ code that uses raw new/delete anyway... this is pretty much bad advice. The way to avoid memory leaks is not to try harder to get manual memory management right. The way to avoid memory leaks is to write code where a memory leak is impossible by design. And that implies writing C++, not C-using-C++-syntax. – Kuba hasn't forgotten Monica Jun 12 '18 at 04:25
1

Here's how this might look - note how much simpler everything can be! You're dead set on doing manual memory management: why? First of all, QFuture is a value. You can store it very efficiently in any vector container that will manage the memory for you. You can iterate such a container using range-for. Etc.

QT = concurrent   # dependencies are automatic, you don't use widgets
CONFIG += c++14 console
CONFIG -= app_bundle
SOURCES = main.cpp

Even though the example is synthetic and the map_function is very simple, it's worth considering how to do things most efficiently and expressively. Your algorithm is a typical map-reduce operation, and blockingMappedReduce has half the overhead of manually doing all of the work.

First of all, let's recast the original problem in C++, instead of some C-with-pluses Frankenstein.

// https://github.com/KubaO/stackoverflown/tree/master/questions/future-ranges-49107082
/* QtConcurrent will include QtCore as well */
#include <QtConcurrent>
#include <algorithm>
#include <iterator>

using result_type = double;

static result_type map_function(int instance){
   return instance * result_type(10);
}

static void sum_modifier(result_type &result, result_type value) {
   result += value;
}

static result_type sum_function(result_type result, result_type value) {
   return result + value;
}

result_type sum_approach1(int const N) {
   QVector<QFuture<result_type>> futures(N);
   int id = 0;
   for (auto &future : futures)
      future = QtConcurrent::run(map_function, id++);
   return std::accumulate(futures.cbegin(), futures.cend(), result_type{}, sum_function);
}

There is no manual memory management, and no explicit splitting into "threads" - that was pointless, since the concurrent execution platform is aware of how many threads there are. So this is already better!

But this seems quite wasteful: each future internally allocates at least once (!).

Instead of using futures explicitly for each result, we can use the map-reduce framework. To generate the sequence, we can define an iterator that provides the integers we wish to work on. The iterator can be a forward or a bidirectional one, and its implementation is the bare minimum needed by QtConcurrent framework.

#include <iterator>

template <typename tag> class num_iterator : public std::iterator<tag, int, int, const int*, int> {
   int num = 0;
   using self = num_iterator;
   using base = std::iterator<tag, int, int, const int*, int>;
public:
   explicit num_iterator(int num = 0) : num(num) {}
   self &operator++() { num ++; return *this; }
   self &operator--() { num --; return *this; }
   self &operator+=(typename base::difference_type d) { num += d; return *this; }
   friend typename base::difference_type operator-(self lhs, self rhs) { return lhs.num - rhs.num; }
   bool operator==(self o) const { return num == o.num; }
   bool operator!=(self o) const { return !(*this == o); }
   typename base::reference operator*() const { return num; }
};

using num_f_iterator = num_iterator<std::forward_iterator_tag>;

result_type sum_approach2(int const N) {
   auto results = QtConcurrent::blockingMapped<QVector<result_type>>(num_f_iterator{0}, num_f_iterator{N}, map_function);
   return std::accumulate(results.cbegin(), results.cend(), result_type{}, sum_function);
}

using num_b_iterator = num_iterator<std::bidirectional_iterator_tag>;

result_type sum_approach3(int const N) {
   auto results = QtConcurrent::blockingMapped<QVector<result_type>>(num_b_iterator{0}, num_b_iterator{N}, map_function);
   return std::accumulate(results.cbegin(), results.cend(), result_type{}, sum_function);
}

Could we drop the std::accumulate and use blockingMappedReduced instead? Sure:

result_type sum_approach4(int const N) {
   return QtConcurrent::blockingMappedReduced(num_b_iterator{0}, num_b_iterator{N},
                                              map_function, sum_modifier);
}

We can also try a random access iterator:

using num_r_iterator = num_iterator<std::random_access_iterator_tag>;

result_type sum_approach5(int const N) {
   return QtConcurrent::blockingMappedReduced(num_r_iterator{0}, num_r_iterator{N},
                                              map_function, sum_modifier);
}

Finally, we can switch from using range-generating iterators, to a precomputed range:

#include <numeric>

result_type sum_approach6(int const N) {
   QVector<int> sequence(N);
   std::iota(sequence.begin(), sequence.end(), 0);
   return QtConcurrent::blockingMappedReduced(sequence, map_function, sum_modifier);
}

Of course, our point is to benchmark it all:

template <typename F> void benchmark(F fun, double const N) {
   QElapsedTimer timer;
   timer.start();
   auto result = fun(N);
   qDebug() << "sum:" << fixed << result << "took" << timer.elapsed()/N << "ms/item";
}

int main() {
   const int N = 1000000;
   benchmark(sum_approach1, N);
   benchmark(sum_approach2, N);
   benchmark(sum_approach3, N);
   benchmark(sum_approach4, N);
   benchmark(sum_approach5, N);
   benchmark(sum_approach6, N);
}

On my system, in release build, the output is:

sum: 4999995000000.000000 took 0.015778 ms/item
sum: 4999995000000.000000 took 0.003631 ms/item
sum: 4999995000000.000000 took 0.003610 ms/item
sum: 4999995000000.000000 took 0.005414 ms/item
sum: 4999995000000.000000 took 0.000011 ms/item
sum: 4999995000000.000000 took 0.000008 ms/item

Note how using map-reduce on a random-iterable sequence has over 3 orders of magnitude lower overhead than using QtConcurrent::run, and is 2 orders of magnitude faster than non-random-iterable solutions.

Kuba hasn't forgotten Monica
  • 95,931
  • 16
  • 151
  • 313