4

I am trying to replicate Caruana et al.'s method for Ensemble selection from libraries of models (pdf). At the core of the method is a greedy algorithm for adding models to the ensemble (models can be added more than once). I've written an implementation for this greedy optimization algorithm, but it is very slow:

library(compiler)
set.seed(42)
X <- matrix(runif(100000*10), ncol=10)
Y <- rnorm(100000)

greedOpt <- cmpfun(function(X, Y, iter=100){
  weights <- rep(0, ncol(X))

  while(sum(weights) < iter) {

    errors <- sapply(1:ncol(X), function(y){
      newweights <- weights
      newweights[y] <- newweights[y] + 1  
      pred <- X %*% (newweights)/sum(newweights)
      error <- Y - pred
      sqrt(mean(error^2))
    })

    update <- which.min(errors)
    weights[update] <- weights[update]+1
  }
  return(weights/sum(weights))
})

system.time(a <- greedOpt(X,Y))

I know R doesn't do loops well, but I can't think of any way to do this type of stepwise search without a loop.

Any suggestions for improving this function?

Zach
  • 29,791
  • 35
  • 142
  • 201
  • Are `max.members` and `iter` interchangeable? I don't think it is fair or useful to compare your greedOpt function with lm.fit as they are doing very different things. `greedOpt` is using a loop to do something perhaps 1000 times, lm.fit doesn't. – mnel Feb 18 '13 at 22:54
  • @mnel Yes, I corrected my code. I agree, `lm.fit` isn't fair, but I feel like it's a good upper bound for the best performance I could get. – Zach Feb 18 '13 at 23:25
  • Were you successful in implementing Caruana's ensemble selection method? – Ben Rollert Dec 17 '16 at 04:17
  • @BenRollert See my caretEnsemble package. I ended up getting better results using a glm to find the weights. https://github.com/zachmayer/caretEnsemble – Zach Dec 18 '16 at 14:59
  • @Zach I've experienced much better performance using the ensemble selection method vs glm when working with a partner on Kaggle comp (we secured top-4 placement as a result of his approach). A glm, especially not regularized, can easily overfit with a lot of base models. – Ben Rollert Dec 20 '16 at 03:03
  • @BenRollert I meant a regularized glm. I've found that allowing coefficients to be negative helped a lot in kaggle competitions like the recent allstate claims severity competition. Regardless, if you wrap greedy forward weighting in the caret API, you can use it with caret or with caretEnsemble: https://topepo.github.io/caret/using-your-own-model-in-train.html – Zach Dec 20 '16 at 15:23

2 Answers2

3

I took a shot at writing an Rcpp version of this function:

library(Rcpp)
cppFunction('
  NumericVector greedOptC(NumericMatrix X, NumericVector Y, int iter) {
    int nrow = X.nrow(), ncol = X.ncol();
    NumericVector weights(ncol);
    NumericVector newweights(ncol);
    NumericVector errors(nrow);
    double RMSE;
    double bestRMSE;
    int bestCol;

    for (int i = 0; i < iter; i++) {
      bestRMSE = -1;
      bestCol = 1;
      for (int j = 0; j < ncol; j++) {
        newweights = weights + 0;
        newweights[j] = newweights[j] + 1;
        newweights = newweights/sum(newweights);

        NumericVector pred(nrow);
        for (int k = 0; k < ncol; k++){
          pred = pred + newweights[k] * X( _, k);
        }

        errors = Y - pred;
        RMSE = sqrt(mean(errors*errors));

        if (RMSE < bestRMSE || bestRMSE==-1){
          bestRMSE = RMSE;
          bestCol = j;
        }
      }

      weights[bestCol] = weights[bestCol] + 1;
    }

    weights = weights/sum(weights);
    return weights;
  }
')

It's more than twice as fast as the R version:

set.seed(42)
X <- matrix(runif(100000*10), ncol=10)
Y <- rnorm(100000)
> system.time(a <- greedOpt(X, Y, 1000))
   user  system elapsed 
  36.19    6.10   42.40 
> system.time(b <- greedOptC(X, Y, 1000))
   user  system elapsed 
  16.50    1.44   18.04
> all.equal(a,b)
[1] TRUE

Not bad, but I was hoping for a bigger speedup when making the leap from R to Rcpp. This is one of the first Rcpp functions I've ever written, so perhaps further optimization is possible.

Zach
  • 29,791
  • 35
  • 142
  • 201
  • Not too surprising since most of your computation goes into that matrix multiplication, for which R is already doing a pretty good job. I think the biggest improvement you can get is by switching to the atlas library. – flodel Feb 19 '13 at 00:23
  • Other small untested improvements include: (a) replace that `sapply(...)` with `vapply(..., numeric(1L))`, (b) replace the division `/sum(newweights)` by a multiplication `* (1 / sum(newweights))`, (c) use integers instead of numerics: `weights <- rep(0L, ncol(X))`, `newweights[y] + 1L`, and `weights[update] + 1L`. – flodel Feb 19 '13 at 00:27
3

Here is an R implementation that is 30% faster than yours. Not as fast as your Rcpp version but maybe it will give you ideas that combined with Rcpp will speed things further. The two main improvements are:

  1. the sapply loop has been replaced by a matrix formulation
  2. the matrix multiplication has been replaced by a recursion

greedOpt <- cmpfun(function(X, Y, iter = 100L){

  N           <- ncol(X)
  weights     <- rep(0L, N)
  pred        <- 0 * X
  sum.weights <- 0L

  while(sum.weights < iter) {

      sum.weights   <- sum.weights + 1L
      pred          <- (pred + X) * (1L / sum.weights)
      errors        <- sqrt(colSums((pred - Y) ^ 2L))
      best          <- which.min(errors)
      weights[best] <- weights[best] + 1L
      pred          <- pred[, best] * sum.weights
  }
  return(weights / sum.weights)
})

Also, I maintain you should try upgrading to the atlas library. You might see significant improvements.

flodel
  • 87,577
  • 21
  • 185
  • 223
  • That is very fast! On my laptop it clocks in at 23 seconds, while the Rcpp version I wrote is 18 seconds. I'll look into Atlas, thanks for the suggestion. – Zach Feb 19 '13 at 02:30