You can do a lot better (rmse ~ 0.04, $R^2$ > 0.99) by training individual trees on small samples or bites as Breiman called them
Since there is a significant amount of noise in the training data, this problem is really about smoothing rather than generalization. In general machine learning terms this requires increasing regularization. For ensemble learner this means trading strength for diversity.
Diversity of randomForests can be increasing by reducing the number of candidate feature per split (mtry
in R) or the training set of each tree (sampsize
in R). Since there is only 1 input dimesions, mtry
does not help, leaving sampsize
. This leads to a 3.5x improvement in RMSE over the default settings and >6x improvement over the noisy training data itself. Since increased divresity means increased variance in the prediction of the individual learners, we also need to increase the number of trees to stabilize the ensemble prediction.
small bags, more trees :: rmse = 0.04:
>sd(predict(randomForest(Y~.,data=mat, sampsize=60, nodesize=2,
replace=FALSE, ntree=5000),
mat)
- sin(x))
[1] 0.03912643
default settings :: rmse=0.14:
> sd(predict(randomForest(Y~.,data=mat),mat) - sin(x))
[1] 0.1413018
error due to noise in training set :: rmse = 0.25
> sd(y - sin(x))
[1] 0.2548882
The error due to noise is of course evident from
noise<-rnorm(1001)
y<-sin(x)+noise/4
In the above the evaluation is being done against the training set, as it is in the original question. Since the issue is smoothing rather than generalization, this is not as egregious as it may seem, but it is reassuring to see that out of bag evaluation shows similar accuracy:
> sd(predict(randomForest(Y~.,data=mat, sampsize=60, nodesize=2,
replace=FALSE, ntree=5000))
- sin(x))
[1] 0.04059679