8

I am using the random forest.My test accuracy is 70% on the other hand train accuracy is 34% ? what to do ? How can I solve this problem.

3 Answers3

17

Test accuracy should not be higher than train since the model is optimized for the latter. Ways in which this behavior might happen:

  • you did not use the same source dataset for test. You should do a proper train/test split in which both of them have the same underlying distribution. Most likely you provided a completely different (and more agreeable) dataset for test

  • an unreasonably high degree of regularization was applied. Even so there would need to be some element of "test data distribution is not the same as that of train" for the observed behavior to occur.

WestCoastProjects
  • 58,982
  • 91
  • 316
  • 560
  • 3
    I agree with @javadba and would like to add: Another reason could be data contamination where records from the train set also exist in the test set. – Eran Yogev Apr 27 '20 at 13:26
  • I disagree with this; model metrics would be more representative of real-world performance if test data is more accurate than train. The downside is that the model performance would be better if the train data was more accurate. – Ethereal Jul 14 '22 at 19:52
  • I disagree with the statement that it *should* not be higher than train. In OP's case, the gap is large enough that it should not be. However, in some cases (such as mine), an adversarially trained model can be expected to have a lower training accuracy than a benign test set. – AndW Jul 16 '22 at 02:53
  • @a6623 Thanks for that clarification ie my statement does not hold in all cases. In particular `adversarial` models are a different animal – WestCoastProjects Jul 16 '22 at 15:03
3

The other answers are correct in most cases. But I'd like to offer another perspective. There are specific training regimes that could cause the training data to be harder for the model to learn - for instance, adversarial training or adding Gaussian noise to the training examples. In these cases, the benign test accuracy could be higher than train accuracy, because benign examples are easier to evaluate. This isn't always a problem, however!

If this applies to you, and the gap between train and test accuracies is larger than you'd like (~30%, as in your question, is a pretty big gap), then this indicates that your model is underfitting to the harder patterns, so you'll need to increase the expressibility of your model. In the case of random forests, this might mean training the trees to a higher depth.

AndW
  • 726
  • 6
  • 31
1

First you should check the data that is used for training. I think there is some problem with the data, the data may not be properly pre-processed.

Also, in this case, you should try more epochs. Plot the learning curve to analyze when the model is going to converge.

You should check the following:

  1. Both training and validation accuracy scores should increase and loss should decrease.
  2. If there is something wrong in step 1 after any particular epoch, then train your model until that epoch only, because your model is over-fitting after that.
karel
  • 5,489
  • 46
  • 45
  • 50