0

I am trying to understand where exactly SMOTE-ing should occur when training a model with cross-validation. I understand that all pre-processing steps should occur for each fold of cross-validation. So does that mean the following two set ups are identical and theoretically correct?

SET UP 1: Use recipes to pre-process, smote within trainControl

set.seed(888, sample.kind = "Rounding")
tr_ctrl <- trainControl(summaryFunction = twoClassSummary, 
                        verboseIter = TRUE, 
                        savePredictions =  TRUE, 
                        sampling = "smote", 
                        method = "repeatedCV", 
                        number= 2, 
                        repeats = 0,
                        classProbs = TRUE, 
                        allowParallel = TRUE, 
                        )
cw_smote_recipe <- recipe(husb_beat ~ ., data = nfhs_train) %>%
  step_nzv(all_predictors()) %>%               
  step_naomit(all_predictors()) %>%
  step_dummy(all_nominal(), -husb_beat) %>%
  step_interact(~starts_with("State"):starts_with("wave"))%>%
  step_interact(~starts_with("husb_drink"):starts_with("husb_legal"))
  

cw_logit1 <- train(cw_smote_recipe, data = nfhs_train,
                            method = "glm",
                            family = 'binomial',
                            metric = "ROC",
                            trControl = tr_ctrl)

SET UP 2: Use recipes to pre-process AND smote : DOES THIS SMOTE WITHIN EACH CV FOLD??

set.seed(888, sample.kind = "Rounding")
tr_ctrl <- trainControl(summaryFunction = twoClassSummary, 
                        verboseIter = TRUE, 
                        savePredictions =  TRUE, 
                        #sampling = "smote", ## NO LONGER WITHIN TRAINCONTROL
                        method = "repeatedCV", 
                        number= 2, 
                        repeats = 0,
                        classProbs = TRUE, 
                        allowParallel = TRUE, 
                        )
smote_recipe <- recipe(husb_beat ~ ., data = nfhs_train) %>%
  step_nzv(all_predictors()) %>%               
  step_naomit(all_predictors()) %>%
  step_dummy(all_nominal(), -husb_beat) %>%
  step_interact(~starts_with("State"):starts_with("wave"))%>%
  step_interact(~starts_with("husb_drink"):starts_with("husb_legal"))%>%
  step_smote(husb_beat) ## NEW STEP TO RECIPE

  

cw_logit2 <- train(smote_recipe, data = nfhs_train,
                            method = "glm",
                            family = 'binomial',
                            metric = "ROC",
                            trControl = tr_ctrl)

TIA!

StupidWolf
  • 45,075
  • 17
  • 40
  • 72
  • both of these setups should be identical. Do they produce the same models? – missuse Jan 25 '21 at 15:24
  • Hello, I have not been able to confirm that since my code take hours to run. I will check and report back. – datadocsharma Jan 25 '21 at 16:56
  • take a smaller data set like iris or sonar and use a fast algorithm without tuning hyperparameters and you should be able to run it in a few seconds. – missuse Jan 25 '21 at 16:58
  • as far as I know there is no data leakage during resampling in caret, no matter if you use recipes or defining preprocessing/sampling via `train`/`trainControl`. – missuse Jan 25 '21 at 17:01
  • Hello, I do not find these two features to be similar in my data. So I did some further reading. Chapter 11 of https://topepo.github.io/caret/subsampling-for-class-imbalances.html seems to imply ..............AND THIS IS WHERE I NEED FEEDBACK TO MAKE SURE MY INTERPRETATION IS CORRECT! Chapter 11 seems to imply that smote within trainControl applies smote to each and every resample. Anything that is not within trainControl is outside of resampling. Please let me know if my interpretation is correct! – datadocsharma Feb 05 '21 at 22:13
  • it is very late to add my 2 cents, but I comment just to confirm your intuition. As said in chapter 11 of your source, specifying the subsampling method inside `trainControl` force `train` to conduct subsampling inside of resampling. – Elia Apr 21 '21 at 10:15

0 Answers0