3

I found a couple of explanations what the out-of-bag error is, including one on stackoverflow: What is out of bag error in random forests

However I could not find any formulas how to calculate it exactly. Let's take the MATLAB help-files as an example: err = oobError(B) computes the misclassification probability[...]. B is the model of the trees generated with class TreeBagger.

What is the misclassification probability? Is it simply the accuracy of the out-of-bag data?

Accuracy = (TP + FP) / (P+N)

So simply the ratio of all truly classified instances over all instances present in the set?

If this is correct, I on the one hand see the benefit of calculating it, at is quite simple if you have some datasets to test on anyway, as the out-of-bag dataset are.

But on the other hand, accuracy is known to be not a good metric when it comes to imbalanced datasets. So my second question then is: Can the out-of-bag error cope with imbalanced datasets, and if not, is it even a valid point to specify it in such cases?

Community
  • 1
  • 1
muuh
  • 1,013
  • 12
  • 24

1 Answers1

1

Out of bag error is simply error computed on samples not seen during training. It has important role in bagging methods, as due to bootstraping of the training set (building new set through drawing at random with replacement) you actualy get quite a chunk of training data not used (in limit it is around 30%). If you have many such models (like in random forest where you have many trees, each trained on its own boostrap sample) then you can average over these errors and get a estimate of the generalization error.

What is the misclassification probability? Is it simply the accuracy of the out-of-bag data?

Missclassification probability is 1-Accuracy

If this is correct, I on the one hand see the benefit of calculating it, at is quite simple if you have some datasets to test on anyway, as the out-of-bag dataset are.

Because using one test set approximates only the quality of the current model (whatever it is), while doing out-of-bag is a kind of estimate of the single element in your ensemble (tree in case of random forest) over all possible selections of the training set. This is different probabilistic measure, see for example chapter 7 of Tibshirani's elements of statistical learning. Furthermore its strength is that you do not waste any points. Keeping a separate test set requires considerable amount of points so you can get reasonable estimator (model) on the remaining data. Out-of-bag estimate gives you ability to say something about how well it behaves while at the same time - use all data available.

But on the other hand, accuracy is known to be not a good metric when it comes to imbalanced datasets. So my second question then is: Can the out-of-bag error cope with imbalanced datasets, and if not, is it even a valid point to specify it in such cases?

Out of bag error has nothing to do with accuracy. It is implemented in scikit-learn to work with accuracy but it is defined over any loss function (classification metric). You can do the exact analogue with MCC, F1, or anything you want.

lejlot
  • 64,777
  • 8
  • 131
  • 164
  • Thank's for the answer so far - it makes perfectly sense, that: error = 1 - accuracy. But than I don't get your last point "out-of-bag-error has nothing to do with accuracy". Obviously the equation is based on accuracy. And also I still don't understand if the oob-error is usable in imbalanced classes. – muuh Nov 17 '15 at 13:05
  • out-of-bag-error is "error on out-of-bag samples", definition of error is **arbitrary**. scikit-learn developers decided to implement accuracy only, but this is **theoretical object**, not scikit-learn's one. So it (as an object) has nothing to do with accuracy. Scikit-learns implementation does, and as you said - is useless if you use any other metric (like in imbalanced scenario) – lejlot Nov 17 '15 at 13:19