In your Mode
function, since you are mostly calling sugar wrapper functions, you won't see that much improvement over base R
. In fact, simply writing a faithful base R translation, we have:
baseMode <- function(x, narm = FALSE) {
if (narm) x <- x[!is.na(x)]
ux <- unique(x)
ux[which.max(table(match(x, ux)))]
}
And benchmarking, we have:
set.seed(1234)
s <- sample(1e5, replace = TRUE)
library(microbenchmark)
microbenchmark(Mode(s), baseMode(s), times = 10, unit = "relative")
Unit: relative
expr min lq mean median uq max neval
Mode(s) 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 10
baseMode(s) 1.490765 1.645367 1.571132 1.616061 1.637181 1.448306 10
Typically, when we undertake the effort of writing our own compiled code, we would expect bigger gains. Simply wrapping these already efficient compiled functions in Rcpp
isn't going to magically get you the gains you expect. In fact, on larger examples the base solution is faster. Observe:
set.seed(1234)
sBig <- sample(1e6, replace = TRUE)
system.time(Mode(sBig))
user system elapsed
1.410 0.036 1.450
system.time(baseMode(sBig))
user system elapsed
0.915 0.025 0.943
To address your question of writing a faster mode function, we can make use of std::unordered_map
, which is very similar to table
underneath the hood (i.e. they are both hash tables at their heart). Additionally, since you are returning a single integer, we can safely assume that we can replace NumericVector
with IntegerVector
and also that you are not concerned with returning every value that occurs the most.
The algorithm below can be modified to return the true mode, but I will leave that as an exercise (hint: you will need std::vector
along with taking some sort of action when it->second == myMax
). N.B. you will also need to add // [[Rcpp::plugins(cpp11)]]
at the top of your cpp file for std::unordered_map
and auto
.
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::plugins(cpp11)]]
#include <unordered_map>
// [[Rcpp::export]]
int fastIntMode(IntegerVector x, bool narm = false) {
if (narm) x = x[!is_na(x)];
int myMax = 1;
int myMode = 0;
std::unordered_map<int, int> modeMap;
modeMap.reserve(x.size());
for (std::size_t i = 0, len = x.size(); i < len; ++i) {
auto it = modeMap.find(x[i]);
if (it != modeMap.end()) {
++(it->second);
if (it->second > myMax) {
myMax = it->second;
myMode = x[i];
}
} else {
modeMap.insert({x[i], 1});
}
}
return myMode;
}
And the benchmarks:
microbenchmark(Mode(s), baseMode(s), fastIntMode(s), times = 15, unit = "relative")
Unit: relative
expr min lq mean median uq max neval
Mode(s) 6.428343 6.268131 6.622914 6.134388 6.881746 7.78522 15
baseMode(s) 9.757491 9.404101 9.454857 9.169315 9.018938 10.16640 15
fastIntMode(s) 1.000000 1.000000 1.000000 1.000000 1.000000 1.00000 15
Now we are talking... about 6x faster than the original and 9x faster than base. They all return the same value:
fastIntMode(s)
##[1] 85433
baseMode(s)
##[1] 85433
Mode(s)
##[1] 85433
And for our larger example:
## base R returned in 0.943s
system.time(fastIntMode(s))
user system elapsed
0.217 0.006 0.224
In addition to the implicit CC-BY-SA
license I hereby license the code in this answer under the GPL >= 2
.