0

I am dealing with a heavily imbalanced response variable, so my supervisor has recommended I use SMOTE in order to upsample the minority observations in my data set. The data consists of many categorical predictors and as I understand it themis::step_smote from the tidymodels ecosystem only accepts numerical features so far.

I am aware that I can convert my factors and strings to numerical dummies by using recipe::step_dummies, but I am worried that the synthetic observations will create values for these dummies that do not make any logical sense (values between 0 and 1, where logically only 0 and 1 are possible).

Is this a legitimate concern at all or can I proceed with using SMOTE on categorical dummies?

O René
  • 305
  • 1
  • 12
  • 2
    I do believe that is a result you may end up and depending on the model estimation you use and the exact imbalance, it's possible it could be a problem? I have not seen such a situation be a problem for results in a predictive context. If you do run into problems, one option you might consider is using the ROSE algorithm instead. This is [also available in themis and no longer requires all numeric features](https://github.com/tidymodels/themis/blob/master/NEWS.md) in the development version. – Julia Silge Oct 15 '21 at 23:29
  • thank you very much for your answer. I will give ROSE a try and see how it turns out. – O René Oct 18 '21 at 12:54

1 Answers1

0

You should call step_smote() on categorical data, here is an example. Notice that step_dummy() is called for all nominal variables except for died

Mark Rieke
  • 306
  • 3
  • 13
  • I am aware of this blogpost, but it doesn't answer my question if calling SMOTE on dummy variables is methodically sound. – O René Oct 12 '21 at 16:18
  • [Here](https://themis.tidymodels.org/reference/smote.html) is the reference page for the `smote()` function that is setup by `step_smote()`. The class you want to upsample should be a factor & all others should be numeric (aka, call `step_dummy()` before calling `step_smote()`). You don't want to use `step_smote()` on something that has been changed to a dummy variable! – Mark Rieke Oct 12 '21 at 16:39
  • I think you misunderstand my question. I know how to apply the recipe steps, that's not the problem. What I am wondering is whether it is problematic to use SMOTE if some features have been turned into dummy variables beforehand, because it is likely to produce decimal values for those dummies that do not make any logical sense. This is all about features, not the factor response. – O René Oct 13 '21 at 09:51
  • If you're using a binary classifier, it may be fine, but for multiclass classification it's definitely not. Smote will upsample minority classes called out in a column, but when you create dummy variables, a new column is created for each class (with a value of 1 or 0). Regardless, for the tidymodels implementation, the `themis` package is intended to be used with a factor column. I definitely recommend using the factor approach to avoid any unintentional consequences! – Mark Rieke Oct 13 '21 at 15:29
  • I'm not talking about multiclass classification, this is all about categorical features. But it's alright, I'll figure this out on my own. Thanks for trying, even if we talked way past each other. – O René Oct 14 '21 at 09:19
  • Smote is an upsampling method, so it'll add entire rows of data to try to balance the classes. You'll never end up with decimal values for features by calling smote, but if you've already created dummy variables then call smote, you won't balance the classes correctly. If you have 3 categories to balance, smote will resample until each class makes up ~1/3 of the dataset. If you call `step_dummy` before smote, it'll resample to try to balance all the dummy 1s and 0s to be ~1/6 of the dataset, which doesn't make any sense. I hope this helps, apologies for any confusion I may have caused earlier! – Mark Rieke Oct 14 '21 at 18:06
  • you keep talking about the response, which is not what this post is about. it's about the features only. categorical features that must be converted to dummies for SMOTE to work. BUT the newly created, synthetic observations will contain decimal values in the dummy PREDICTORS. this is what I'm worried about, not the response. – O René Oct 15 '21 at 12:16
  • 1
    smote won't create decimal values, but new observations - here's a snippet you can run on the predictor `cut` from the `diamonds` dataset: ```# diamonds dataset - there are far fewer fair/good diamonds diamonds %>% count(cut) # diamonds with smote evens out the classes in the predictor 'cut' recipe(price ~ ., data = diamonds) %>% step_dummy(all_nominal_predictors(), -cut) %>% step_smote(cut) %>% prep() %>% bake(new_data = NULL) %>% count(cut)``` You can use smote on predictors! Just make sure they're not dummy variables. I'll leave it there - hopefully this helps! – Mark Rieke Oct 15 '21 at 14:34