When should you use up sampling in tidymodels?

Question

I'm having some difficulty understanding when upsampling should be used when specifying the training dataset in tidymodels or otherwise.

For example, suppose you were building a classification model that would predict if baseball players got a hit (HIT) or not (NOHIT). If you had a dataset of 10,000 at-bats approximately 2700 - 3000 target variables would be HIT and the remainder would be NOHIT - that baseball.

This is an unbalanced dataset, however, the underlying system happens to be unbalanced. That being the case should up_sampling be used on the target variable of our classification model or would doing so produce erroneous results.

You can read a bit more in detail [in this post](https://www.tidymodels.org/learn/models/sub-sampling/); subsampling like up- or down-sampling typically creates models that are better _calibrated_ and more able to learn both the positive and negative classes. — Julia Silge, Apr 08 '21 at 21:49
Test it out! Run the model on full data set first to get a baseline and check its accuracy/performance (AUC/ROC, Confussion matrix, etc). Then do up sample on the minority class and compare the results/performance. then do down sampling on majority class and compare the performance. Then try SMOTE on the minority and compare the results.Etc. — chitown88, Apr 09 '21 at 10:18

When should you use up sampling in tidymodels?

0 Answers0