Why does mclapply function in R is more efficient than Rcpp + OpenMP?

Question

I have a function with a loop (EstimateUniques) that is parallelized with OpenMP. I suggested that multithreading should be more efficient than multiprocessing, but when I compare this function with the simple run of "mclapply", it showed lower performance. What is the proper way to achieve the same level of parallelization in c++ as in R? Am I doing something wrong?

Performance comparison (time in seconds):

#Cores    CPP     R
   1    1.721s  1.538s
   2    1.945s  1.080s
   3    2.858s  0.801s

R code:

Rcpp::sourceCpp('ReproducibleExample.cpp')

arr <- 1:10000
n_rep <- 150
n_iters <- 200

EstimateUniquesR <- function(arr, n_iters, n_rep, cores) {
  parallel::mclapply(1:n_iters, function(i) 
    GetNumberOfUniqSamples(arr, i * 10, n_rep), mc.cores=cores)
}

cpp_times <- sapply(1:3, function(threads) 
  system.time(EstimateUniques(arr, n_iters, n_rep, threads))['elapsed'])
r_times <- sapply(1:3, function(cores) 
  system.time(EstimateUniquesR(arr, n_iters, n_rep, cores))['elapsed'])

data.frame(CPP=cpp_times, R=r_times)

Example.cpp file:

// [[Rcpp::plugins(openmp)]]
// [[Rcpp::plugins(cpp11)]]

#include <algorithm>
#include <vector>
#include <omp.h>

// [[Rcpp::export]]
int GetNumberOfUniqSamples(const std::vector<int> &bs_array, int size, unsigned n_rep) {
  unsigned long sum = 0;
  for (unsigned i = 0; i < n_rep; ++i) {
    std::vector<int> uniq_vals(size);
    for (int try_num = 0; try_num < size; ++try_num) {
      uniq_vals[try_num] = bs_array[rand() % bs_array.size()];
    }
    std::sort(uniq_vals.begin(), uniq_vals.end());
    sum += std::distance(uniq_vals.begin(), std::unique(uniq_vals.begin(), uniq_vals.end()));
  }

  return std::round(double(sum) / n_rep);
}

// [[Rcpp::export]]
std::vector<int> EstimateUniques(const std::vector<int> &bs_array, const int n_iters, 
                                 const int n_rep = 1000, const int threads=1) {
  std::vector<int> uniq_counts(n_iters);

#pragma omp parallel for num_threads(threads) schedule(dynamic)
  for (int i = 0; i < n_iters; ++i) {
    uniq_counts[i] = GetNumberOfUniqSamples(bs_array, (i + 1) * 10, n_rep);
  }

  return uniq_counts;
}

I tried to use other types of scheduling in OpenMP, but they gave even worse results.

I have a feeling that this is primarily the _cost_ associated with copying data from an R Object into a C++ object... Try creating the `arr` object in C++ — coatless, Jun 24 '17 at 18:26
Thank you for the answer, but if you were right, we would see the opposite situation: **EstimateUniquesR** transforms data from R to c++ many times, while **EstimateUniques** do it only once. — Viktor Petukhov, Jun 24 '17 at 20:27
respectfully good sir... Not the case. _R_ processes are _spun_ up. There is _never_ a transference of data requiring a **deep** copy unlike the _C++_ example. Consult `?parallel::mclapply` and `parallel::mclapply` source for details. I'll try to provide a better overview a bit later. — coatless, Jun 24 '17 at 20:44

Why does mclapply function in R is more efficient than Rcpp + OpenMP?

0 Answers0