0

I'm having some difficulty understanding when upsampling should be used when specifying the training dataset in tidymodels or otherwise.

For example, suppose you were building a classification model that would predict if baseball players got a hit (HIT) or not (NOHIT). If you had a dataset of 10,000 at-bats approximately 2700 - 3000 target variables would be HIT and the remainder would be NOHIT - that baseball.

This is an unbalanced dataset, however, the underlying system happens to be unbalanced. That being the case should up_sampling be used on the target variable of our classification model or would doing so produce erroneous results.

Mutuelinvestor
  • 3,384
  • 10
  • 44
  • 75
  • 1
    You can read a bit more in detail [in this post](https://www.tidymodels.org/learn/models/sub-sampling/); subsampling like up- or down-sampling typically creates models that are better _calibrated_ and more able to learn both the positive and negative classes. – Julia Silge Apr 08 '21 at 21:49
  • 1
    Test it out! Run the model on full data set first to get a baseline and check its accuracy/performance (AUC/ROC, Confussion matrix, etc). Then do up sample on the minority class and compare the results/performance. then do down sampling on majority class and compare the performance. Then try SMOTE on the minority and compare the results.Etc. – chitown88 Apr 09 '21 at 10:18

0 Answers0