RcppArmadillo: diagonal matrix multiplication is very slow

Question

Let x be a vector and M a matrix.

In R, I can do

D <- diag(exp(x))
crossprod(M, D%M)

and in RcppArmadillo, I have the following which is much slower.

#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]

// [[Rcpp::export]]
arma::mat multiple_mnv(const arma::vec& x, const arma::mat& M) {
  arma::colvec diagonal(x.size())
  for (int i = 0; i < x.size(); i++)
  {
    diagonal(i) = exp(x[i]);
  }
  arma::mat D = diagmat(diagonal);
  return     M.t()*D*M;
}

Why is this so slow? How can I speed this up?

Try changing `return M.t()*D*M` to `return M.t()*diagmat(diagonal)*M`. The latter expression gives Armadillo more information on what `diagonal` is, so it may optimize more. Note that the generation of `diagonal` is redundant, as this expression should also work: `return M.t()*diagmat(arma::exp(x))*M` — mtall, Nov 05 '19 at 06:53

duckmayr · Answer 1 · 2019-11-05T12:28:01.193

Welcome to Stack Overflow manju. For future questions, please be advised that a minimal reproducible example is expected, and in fact is in your best interest to provide; it helps others help you. Here's an example of how you could provide example data for others to work with:

## Set seed for reproducibility
set.seed(123)
## Generate data
x <- rnorm(10)
M <- matrix(rnorm(100), nrow = 10, ncol = 10)
## Output code for others to copy your objects
dput(x)
dput(M)

This is the data I will work with to show that your C++ code is in fact not slower than R. I used your C++ code (adding in a missing semicolon):

#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]

// [[Rcpp::export]]
arma::mat foo(const arma::vec& x, const arma::mat& M) {
    arma::colvec diagonal(x.size());
    for ( int i = 0; i < x.size(); i++ )
    {
        diagonal(i) = exp(x[i]);
    }
    arma::mat D = diagmat(diagonal);
    return M.t() * D * M;
}

Note also that I had to make some of my own choices about the type of the return object and types of the function arguments (this is one of the places where a minimal reproducible example could help you: What if these choices affect my results?) I then create an R function to do what foo() does:

bar <- function(v, M) {
    D <- diag(exp(v))
    return(crossprod(M, D %*% M))
}

Note also that I had to fix a typo you had, changing D%M to D %*% M. Let's double check they give the same results:

all.equal(foo(x, M), bar(x, M))
# [1] TRUE

Now let's explore how fast they are:

library(microbenchmark)
bench <- microbenchmark(cpp = foo(x, M), R = foo(x, M), times = 1e5)
bench
# Unit: microseconds
#  expr    min     lq     mean median     uq      max
#   cpp 22.185 23.015 27.00436 23.204 23.461 31143.30
#     R 22.126 23.028 25.48256 23.216 23.475 29628.86

Those look pretty much the same to me! We can also look at a density plot of the times (throwing out the extreme value outliers to make things a little clearer):

cpp_times <- with(bench, time[expr == "cpp"])
R_times   <- with(bench, time[expr == "R"])
cpp_time_dens <- density(cpp_times[cpp_times < quantile(cpp_times, 0.95)])
R_time_dens   <- density(R_times[R_times < quantile(R_times, 0.95)])
plot(cpp_time_dens, col = "blue", xlab = "Time (in nanoseconds)", ylab = "",
     main = "Comparing C++ and R execution time")
lines(R_time_dens, col = "red")
legend("topright", col = c("blue", "red"), bty = "n", lty = 1,
       legend = c("C++ function (foo)", "R function (bar)"))

Why?

As helpfully pointed out by Dirk Eddelbuettel in the comments, in the end both R and Armadillo are going to be calling a LAPACK or BLAS routine anyways -- you shouldn't expect much difference unless you can give Armadillo a hint on how to be more efficient.

Can we make the Armadillo code faster?

Yes! As pointed out by mtall in the comments, we can give Armadillo the hint that we're dealing with a diagonal matrix. Let's try; we'll use the following code:

// [[Rcpp::export]]
arma::mat baz(const arma::vec& x, const arma::mat& M) {
    return M.t() * diagmat(arma::exp(x)) * M;
}

And benchmark it:

all.equal(foo(x, M), baz(x, M))
# [1] TRUE
library(microbenchmark)
bench <- microbenchmark(cpp = foo(x, M), R = foo(x, M),
                        cpp2 = baz(x, M), times = 1e5)
bench
# Unit: microseconds
#  expr    min     lq     mean median     uq      max
#   cpp 22.822 23.757 27.57015 24.118 24.632 26600.48
#     R 22.855 23.771 26.44725 24.124 24.638 30619.09
#  cpp2 20.035 21.218 25.49863 21.587 22.123 36745.72

We see a small but sure improvement; let's take a look graphically as we did before:

cpp_times  <- with(bench, time[expr == "cpp"])
cpp2_times <- with(bench, time[expr == "cpp2"])
R_times    <- with(bench, time[expr == "R"])
cpp_time_dens  <- density(cpp_times[cpp_times < quantile(cpp_times, 0.95)])
cpp2_time_dens <- density(cpp2_times[cpp2_times < quantile(cpp2_times, 0.95)])
R_time_dens    <- density(R_times[R_times < quantile(R_times, 0.95)])
xlims <- range(c(cpp_time_dens$x, cpp2_time_dens$x, R_time_dens$x))
ylims <- range(c(cpp_time_dens$y, cpp2_time_dens$y, R_time_dens$y))
ylims <- ylims * c(1, 1.15)
cols  <- c("#0072b2", "#f0e442", "#d55e00")
cols  <- c("#e69f00", "#56b4e9", "#009e73")
labs  <- c("C++ original", "C++ improved", "R")
plot(cpp_time_dens, col = cols[1], xlim = xlims, ylim = ylims,
     xlab = "Time (in nanoseconds)", ylab = "",
     main = "Comparing C++ and R execution time")
lines(cpp2_time_dens, col = cols[2])
lines(R_time_dens, col = cols[3])
legend("topleft", col = cols, bty = "n", lty = 1, legend = labs, horiz = TRUE)

Very nice and patient answer. The "tl;dr" is, of course, that it all ends up in the same LAPACK/BLAS subroutine so we expect performance to be similar (module overhead in "how the data gets there"). — Dirk Eddelbuettel, Nov 04 '19 at 20:38
Try changing `return M.t()*D*M` to `return M.t()*diagmat(diagonal)*M`. The latter expression gives Armadillo more information on what `diagonal` is, so it may optimize more. Note that the generation of `diagonal` is redundant, as this expression should also work: `return M.t()*diagmat(arma::exp(x))*M` — mtall, Nov 05 '19 at 06:54
@mtall True. I will add benchmarks with efficient Armadillo code as well, but my main point here was to demonstrate that **given OP's code**, it actually wasn't slower — duckmayr, Nov 05 '19 at 12:16
@mtall I added something on your mentioned improvement to the code; thanks! — duckmayr, Nov 05 '19 at 12:28

RcppArmadillo: diagonal matrix multiplication is very slow

1 Answers1

Why?

Can we make the Armadillo code faster?