2

I'm trying to build a super fast mode function for R to use for aggregating large categorical datasets. The function should take vector input of all supported R types and return the mode. I have read This post, This Help-page and others, but I was not able to make the function take in all R data types. My code now works for numeric vectors, I am relying on Rcpp sugar wrapper functions:

#include <Rcpp.h>

using namespace Rcpp;

// [[Rcpp::export]]
int Mode(NumericVector x, bool narm = false) 
{
    if (narm) x = x[!is_na(x)];
    NumericVector ux = unique(x);
    int y = ux[which_max(table(match(x, ux)))];
    return y;
}

In addition I was wondering if the 'narm' argument can be renamed 'na.rm' without giving errors, and of course if there is a faster way to code a mode function in C++, I would be grateful to know about it.

Sebastian
  • 1,067
  • 7
  • 12
  • 3
    @JosephWood answered your performance question. What did you try concerning support for vector types of any type? – Ralf Stubner Mar 18 '19 at 11:31
  • I tried to use a template with a type macro based on [this post](https://stackoverflow.com/questions/19823915/how-can-i-handle-vectors-without-knowing-the-type-in-rcpp). With Joseph's code I would start like this `template Vector Mode_temp(Vector x, bool narm = false) { if (narm) x = x[!is_na(x)]; int myMax = 1; myMode = x[0]; std::unordered_map<, > modeMap; ` and then `// [[Rcpp::export]] SEXP Mode( SEXP x ){ RCPP_RETURN_VECTOR(Mode_temp, x) ; }`, but don't understand the syntax, and if it world work with unordered_map. – Sebastian Mar 18 '19 at 12:54
  • A better option to me seems the 'switch' syntax described on [this page](http://gallery.rcpp.org/articles/rcpp-wrap-and-recurse/) but here I also ran in trouble with vector input, scalar out put and handling the code internally (i.e. creating vectors and salars of a certain type in each switch case.) – Sebastian Mar 18 '19 at 12:56

3 Answers3

8

In order to make the function work for any vector input, you could implement @JosephWood's algorithm for any data type you want to support and call it from a switch(TYPEOF(x)). But that would be lots of code duplication. Instead, it is better to make a generic function that can work on any Vector<RTYPE> argument. If we follow R's paradigm that everything is a vector and let the function also return a Vector<RTYPE>, then we can make use of RCPP_RETURN_VECTOR. Note that we need C++11 to be able to pass additional arguments to the function called by RCPP_RETURN_VECTOR. One tricky thing is that you need the storage type for Vector<RTYPE> in order to create a suitable std::unordered_map. Here Rcpp::traits::storage_type<RTYPE>::type comes to the rescue. However, std::unordered_map does not know how to deal with complex numbers from R. For simplicity, I am disabling this special case.

Putting it all together:

#include <Rcpp.h>
using namespace Rcpp ;

// [[Rcpp::plugins(cpp11)]]
#include <unordered_map>

template <int RTYPE>
Vector<RTYPE> fastModeImpl(Vector<RTYPE> x, bool narm){
  if (narm) x = x[!is_na(x)];
  int myMax = 1;
  Vector<RTYPE> myMode(1);
  // special case for factors == INTSXP with "class" and "levels" attribute
  if (x.hasAttribute("levels")){
    myMode.attr("class") = x.attr("class");
    myMode.attr("levels") = x.attr("levels");
  }
  std::unordered_map<typename Rcpp::traits::storage_type<RTYPE>::type, int> modeMap;
  modeMap.reserve(x.size());

  for (std::size_t i = 0, len = x.size(); i < len; ++i) {
    auto it = modeMap.find(x[i]);

    if (it != modeMap.end()) {
      ++(it->second);
      if (it->second > myMax) {
        myMax = it->second;
        myMode[0] = x[i];
      }
    } else {
      modeMap.insert({x[i], 1});
    }
  }

  return myMode;
}

template <>
Vector<CPLXSXP> fastModeImpl(Vector<CPLXSXP> x, bool narm) {
  stop("Not supported SEXP type!");
}

// [[Rcpp::export]]
SEXP fastMode( SEXP x, bool narm = false ){
  RCPP_RETURN_VECTOR(fastModeImpl, x, narm);
}

/*** R
set.seed(1234)
s <- sample(1e5, replace = TRUE)
fastMode(s)
fastMode(s + 0.1)
l <- sample(c(TRUE, FALSE), 11, replace = TRUE) 
fastMode(l)
c <- sample(letters, 1e5, replace = TRUE)
fastMode(c)
f <- as.factor(c)
fastMode(f) 
*/

Output:

> set.seed(1234)

> s <- sample(1e5, replace = TRUE)

> fastMode(s)
[1] 85433

> fastMode(s + 0.1)
[1] 85433.1

> l <- sample(c(TRUE, FALSE), 11, replace = TRUE) 

> fastMode(l)
[1] TRUE

> c <- sample(letters, 1e5, replace = TRUE)

> fastMode(c)
[1] "z"

> f <- as.factor(c)

> fastMode(f) 
[1] z
Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z

As noted above, the used algorithm comes from Joseph Wood's answer, which has been explicitly dual-licensed under CC-BY-SA and GPL >= 2. I am following Joseph and hereby license the code in this answer under the GPL (version 2 or later) in addition to the implicit CC-BY-SA license.

Ralf Stubner
  • 26,263
  • 3
  • 40
  • 75
  • 1
    Thank you very much @RalfStubner !! I would have never managed this myself. Your function is about 25 x faster on large categorical data than the base solution, great! The only thing I would add to the code is default to the first value if the mode cannot be found, i.e. initialize ` Vector myMode(1); myMode = x[0];`. Thanks to all, especially also JosephWood I'm sure many will find this helpful. – Sebastian Mar 18 '19 at 15:25
  • 1
    @Sebastian You are welcome. Credit for the speed goes to Joseph Wood, though. – Ralf Stubner Mar 18 '19 at 15:26
  • @RalfStubner , excellent work as always! I actually thought about you as I wrote my answer as I know you are working on implementing a hash function for a sampling function in `dqrng`. I can’t wait to see the final result! – Joseph Wood Mar 18 '19 at 15:42
  • This is a really fast solution. How would we have to adopt the code to return the frequency of the mode as attribute "freq"? – Andri Signorell Oct 18 '19 at 16:52
  • 1
    @AndriSignorell You could add `myMode.attr("freq") = myMax;` before returning `myMode`. – Ralf Stubner Oct 18 '19 at 16:57
  • Works perfectly! Thanks, I will use your code for the function DescTools::Mode and refer to you in this post. Hope that'll be ok. – Andri Signorell Oct 18 '19 at 18:00
  • @AndriSignorell That is more complicated than it looks :-( The initial algorithm was contributed by Joseph Wood (unfortunately I cannot @-mention two people) and (implicitly) licensed under CC-BY-SA 3.0. I adapted the algorithm to use arbitrary input, still with the same (implicit) license. Recently SO changed the license for all content to CC-BY-SA 4.0. Whether or not they had the right do do so is a different question. Now DescTools is licensed under GPL >= 2. It would probably be best if Joseph and I would re-license the code. – Ralf Stubner Oct 18 '19 at 20:29
  • @JosephWood Andri wants to use our code in a GPL licensed package. That would be easiest if we were to re-license the code under GPL >= 2. I would be ok with that, but you would have to go first since my work is based on yours ;-) – Ralf Stubner Oct 18 '19 at 20:33
  • @RalfStubner , that sounds great! I’m not really sure how to proceed. I know I can license my code in a git repo. For what it’s worth, I will gladly follow your lead. – Joseph Wood Oct 18 '19 at 20:38
  • @JosephWood It should be sufficient to state the additional license her on SO. Preferably in an edit to your answer since comments might get removed/changed. Something like the following should do (IANAL!): In addition to the implicit CC-BY-SA license I hereby license the code in this answer under the [GPL](http://www.gnu.org/licenses/gpl.html) (version 2 or later). – Ralf Stubner Oct 18 '19 at 20:55
  • @RalfStubner, how is that? :) – Joseph Wood Oct 18 '19 at 21:00
  • 2
    Best then, thanks for clarifying the complex legal situation. So I will mention you both as authors with link to the post. – Andri Signorell Oct 18 '19 at 23:44
7

In your Mode function, since you are mostly calling sugar wrapper functions, you won't see that much improvement over base R. In fact, simply writing a faithful base R translation, we have:

baseMode <- function(x, narm = FALSE) {
    if (narm) x <- x[!is.na(x)]
    ux <- unique(x)
    ux[which.max(table(match(x, ux)))]
}

And benchmarking, we have:

set.seed(1234)
s <- sample(1e5, replace = TRUE)

library(microbenchmark)
microbenchmark(Mode(s), baseMode(s), times = 10, unit = "relative")
Unit: relative
       expr      min       lq     mean   median       uq      max neval
    Mode(s) 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000    10
baseMode(s) 1.490765 1.645367 1.571132 1.616061 1.637181 1.448306    10

Typically, when we undertake the effort of writing our own compiled code, we would expect bigger gains. Simply wrapping these already efficient compiled functions in Rcpp isn't going to magically get you the gains you expect. In fact, on larger examples the base solution is faster. Observe:

set.seed(1234)
sBig <- sample(1e6, replace = TRUE)

system.time(Mode(sBig))
 user  system elapsed 
1.410   0.036   1.450 

system.time(baseMode(sBig))
 user  system elapsed 
0.915   0.025   0.943 

To address your question of writing a faster mode function, we can make use of std::unordered_map, which is very similar to table underneath the hood (i.e. they are both hash tables at their heart). Additionally, since you are returning a single integer, we can safely assume that we can replace NumericVector with IntegerVector and also that you are not concerned with returning every value that occurs the most.

The algorithm below can be modified to return the true mode, but I will leave that as an exercise (hint: you will need std::vector along with taking some sort of action when it->second == myMax). N.B. you will also need to add // [[Rcpp::plugins(cpp11)]] at the top of your cpp file for std::unordered_map and auto.

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::plugins(cpp11)]]
#include <unordered_map>

// [[Rcpp::export]]
int fastIntMode(IntegerVector x, bool narm = false) {
    if (narm) x = x[!is_na(x)];
    int myMax = 1;
    int myMode = 0;
    std::unordered_map<int, int> modeMap;
    modeMap.reserve(x.size());

    for (std::size_t i = 0, len = x.size(); i < len; ++i) {
        auto it = modeMap.find(x[i]);

        if (it != modeMap.end()) {
            ++(it->second);
            if (it->second > myMax) {
                myMax = it->second;
                myMode = x[i];
            }
        } else {
            modeMap.insert({x[i], 1});
        }
    }

    return myMode;
}

And the benchmarks:

microbenchmark(Mode(s), baseMode(s), fastIntMode(s), times = 15, unit = "relative")
Unit: relative
          expr      min       lq     mean   median       uq      max neval
       Mode(s) 6.428343 6.268131 6.622914 6.134388 6.881746  7.78522    15
   baseMode(s) 9.757491 9.404101 9.454857 9.169315 9.018938 10.16640    15
fastIntMode(s) 1.000000 1.000000 1.000000 1.000000 1.000000  1.00000    15

Now we are talking... about 6x faster than the original and 9x faster than base. They all return the same value:

fastIntMode(s)
##[1] 85433

baseMode(s)
##[1] 85433

Mode(s)
##[1] 85433

And for our larger example:

## base R returned in 0.943s
system.time(fastIntMode(s))
 user  system elapsed 
0.217   0.006   0.224

In addition to the implicit CC-BY-SA license I hereby license the code in this answer under the GPL >= 2.

Joseph Wood
  • 7,077
  • 2
  • 30
  • 65
  • 1
    Is it also recommended to use `boost::unordered_map` instead of `std::unordered_map` for more speed improvement. – F. Privé Mar 18 '19 at 05:55
  • Thanks @F.Privé , I will look into that – Joseph Wood Mar 18 '19 at 12:09
  • Thanks a lot @JosephWood for this extended answer! I already thought Rcpp sugar wouldn't get me far. I presume with the _true Mode_ you mean returning multiple modes if applicable... I just need one of them. In case you also know how to do the method dispatch, i.e. this to work with factors, character and logical vectors, that would be super, as that's really what I spend a lot of time trying to do without succeeding.. – Sebastian Mar 18 '19 at 12:19
  • 1
    Thanks also @F.Privé for this suggestion, I tried to code it:`#include using namespace Rcpp; // [[Rcpp::plugins(cpp11)]] // [[Rcpp::depends(BH)]] #include #include // [[Rcpp::export]] int fastIntMode2(IntegerVector x, bool narm = false) { if (narm) x = x[!is_na(x)]; int myMax = 1; int myMode = x[0]; boost::unordered_map modeMap; // etc ...` But it was not faster: about 5% slower than the code above. – Sebastian Mar 18 '19 at 12:24
  • Thanks for the feedback. – F. Privé Mar 18 '19 at 17:55
1

To follow up with some shameless self-promotion, I have now published a package collapse on CRAN which includes a full set of Fast Statistical Functions, amonst them the generic function fmode. The implementation is based on index hashing and even faster than the solution above. fmode can be used to perform simple, grouped and/or weighted mode calculations on vectors, matrices, data.frames and dplyr grouped tibbles. Syntax:

fmode(x, g = NULL, w = NULL, ...)

where x is a vector, matrix, data.frame or grouped_df, g is a grouping vector or list of grouping vectors, and w is a vector of weights. A compact solution to categorical and mixed aggregation problems is further provided by the function collap. The code

collap(data, ~ id1 + id2, FUN = fmean, catFUN = fmode)

aggregates the mixed type data.frame data applying fmean to numeric and fmode to categorical columns. More customized calls are also possible. Together with the Fast Statistical Functions, collap is just as fast as data.table on large numeric data, and categorical and weighted aggregations are significantly faster than anything that can presently be done with data.table.

Sebastian
  • 1,067
  • 7
  • 12