8

I'm experimenting with R and the randomForest Package, I have some experience with SVM and Neural Nets. My first test is to try and regress: sin(x)+gaussian noise. With Neural Nets and svm I obtain a "relatively" nice approximation of sin(x) so the noise is filtered out and the learning algorithm doesn't overfit. (for decent parameters) When doing the same on randomForest I have a completely overfitted solution. I simply use (R 2.14.0, tried on 2.14.1 too, just in case):

library("randomForest")
x<-seq(-3.14,3.14,by=0.00628)
noise<-rnorm(1001)
y<-sin(x)+noise/4
mat<-matrix(c(x,y),ncol=2,dimnames=list(NULL,c("X","Y")))
plot(x,predict(randomForest(Y~.,data=mat),mat),col="green")
points(x,y)

I guess there is a magic option in randomForest to make it work correctly, I tried a few but I did not find the right lever to pull...

Vincent Zoonekynd
  • 31,893
  • 5
  • 69
  • 78
user1206729
  • 81
  • 1
  • 1
  • 2

3 Answers3

4

You can use maxnodes to limit the size of the trees, as in the examples in the manual.

r <- randomForest(Y~.,data=mat, maxnodes=10)
plot(x,predict(r,mat),col="green")
points(x,y)
Vincent Zoonekynd
  • 31,893
  • 5
  • 69
  • 78
  • That was one of the option I tried, it gives a slightly better result but it still seems very bad compared to svm and nn...there must be a better set of option... – user1206729 Feb 14 '12 at 13:42
  • 2
    One of the interesting things about machine learning is that there is not a one-size-fits-all method. Certain types of algos are better for different types of data. Unfortunately I haven't found a source outlining which method is best for which data set and thus rely almost exclusively on trial and error. – screechOwl Apr 25 '12 at 15:45
1

You can do a lot better (rmse ~ 0.04, $R^2$ > 0.99) by training individual trees on small samples or bites as Breiman called them

Since there is a significant amount of noise in the training data, this problem is really about smoothing rather than generalization. In general machine learning terms this requires increasing regularization. For ensemble learner this means trading strength for diversity.

Diversity of randomForests can be increasing by reducing the number of candidate feature per split (mtry in R) or the training set of each tree (sampsize in R). Since there is only 1 input dimesions, mtry does not help, leaving sampsize. This leads to a 3.5x improvement in RMSE over the default settings and >6x improvement over the noisy training data itself. Since increased divresity means increased variance in the prediction of the individual learners, we also need to increase the number of trees to stabilize the ensemble prediction.

small bags, more trees :: rmse = 0.04:

>sd(predict(randomForest(Y~.,data=mat, sampsize=60, nodesize=2,
                         replace=FALSE, ntree=5000),
            mat)
    - sin(x))
[1] 0.03912643

default settings :: rmse=0.14:

> sd(predict(randomForest(Y~.,data=mat),mat) - sin(x))
[1] 0.1413018

error due to noise in training set :: rmse = 0.25

> sd(y - sin(x))
[1] 0.2548882

The error due to noise is of course evident from

noise<-rnorm(1001)
y<-sin(x)+noise/4

In the above the evaluation is being done against the training set, as it is in the original question. Since the issue is smoothing rather than generalization, this is not as egregious as it may seem, but it is reassuring to see that out of bag evaluation shows similar accuracy:

> sd(predict(randomForest(Y~.,data=mat, sampsize=60, nodesize=2,
                          replace=FALSE, ntree=5000))
     - sin(x))
[1] 0.04059679
Daniel Mahler
  • 7,653
  • 5
  • 51
  • 90
0

My intuition is that:

  • if you had a simple decision tree to fit a 1 dimensional curve f(x), that would be equivalent to fit with a staircase function (not necessarily with equally spaced jumps)
  • with random forests you will make a linear combination of staircase functions

For a staircase function to be a good approximator of f(x), you want enough steps on the x axis, but each step should contain enough points so that their mean is a good approximation of f(x) and less affected by noise.

So I suggest you tune the nodesize parameter. If you have 1 decision tree and N points, and nodesize=n, then your staircase function will have N/n steps. n too small brings to overfitting. I got nice results with n~30 (RMSE~0.07):

r <- randomForest(Y~.,data=mat, nodesize=30)
plot(x,predict(r,mat),col="green")
points(x,y)

Notice that RMSE gets smaller if you take N'=10*N and n'=10*n.

fabiob
  • 273
  • 4
  • 14