How to implement SMOTE in cross validation and GridSearchCV

Question

I'm relatively new to Python. Can you help me improve my implementation of SMOTE to a proper pipeline? What I want is to apply the over and under sampling on the training set of every k-fold iteration so that the model is trained on a balanced data set and evaluated on the imbalanced left out piece. The problem is that when I do that I cannot use the familiar sklearn interface for evaluation and grid search.

Is it possible to make something similar to model_selection.RandomizedSearchCV. My take on this:

df = pd.read_csv("Imbalanced_data.csv") #Load the data set
X = df.iloc[:,0:64]
X = X.values
y = df.iloc[:,64]
y = y.values
n_splits = 2
n_measures = 2 #Recall and AUC
kf = StratifiedKFold(n_splits=n_splits) #Stratified because we need balanced samples
kf.get_n_splits(X)
clf_rf = RandomForestClassifier(n_estimators=25, random_state=1)
s =(n_splits,n_measures)
scores = np.zeros(s)
for train_index, test_index in kf.split(X,y):
   print("TRAIN:", train_index, "TEST:", test_index)
   X_train, X_test = X[train_index], X[test_index]
   y_train, y_test = y[train_index], y[test_index]
   sm = SMOTE(ratio = 'auto',k_neighbors = 5, n_jobs = -1)
   smote_enn = SMOTEENN(smote = sm)
   x_train_res, y_train_res = smote_enn.fit_sample(X_train, y_train)
   clf_rf.fit(x_train_res, y_train_res)
   y_pred = clf_rf.predict(X_test,y_test)
   scores[test_index,1] = recall_score(y_test, y_pred)
   scores[test_index,2] = auc(y_test, y_pred)

Yes, actually your comment helped me tremendously. Thank you very much! — MLearner, Feb 03 '18 at 09:46
hi @VivekKumar does this method ensure that when running K-Fold CV that the validation set will not include over-sampled observations? I am trying to find a way where after I do a train/test split then oversampling on my training set that my validation set for each CV fold from the training set does not contain the bias from the oversampling. thanks! — thePurplePython, May 21 '19 at 04:17
@thePurplePython Yes. You are correct. The `imblearn` pipeline will only call `sample()` method on training data and not on test data. The test data will be passed through without any changes. — Vivek Kumar, May 21 '19 at 05:48

Vivek Kumar · Accepted Answer · 2021-01-12T07:51:41.140

17

You need to look at the pipeline object. imbalanced-learn has a Pipeline which extends the scikit-learn Pipeline, to adapt for the fit_sample() and sample() methods in addition to fit_predict(), fit_transform() and predict() methods of scikit-learn.

Have a look at this example here:

https://imbalanced-learn.org/stable/auto_examples/pipeline/plot_pipeline_classification.html

For your code, you would want to do this:

from imblearn.pipeline import make_pipeline, Pipeline

smote_enn = SMOTEENN(smote = sm)
clf_rf = RandomForestClassifier(n_estimators=25, random_state=1)

pipeline = make_pipeline(smote_enn, clf_rf)
    OR
pipeline = Pipeline([('smote_enn', smote_enn),
                     ('clf_rf', clf_rf)])

Then you can pass this pipeline object to GridSearchCV, RandomizedSearchCV or other cross validation tools in the scikit-learn as a regular object.

kf = StratifiedKFold(n_splits=n_splits)
random_search = RandomizedSearchCV(pipeline, param_distributions=param_dist,
                                   n_iter=1000, 
                                   cv = kf)

edited Jan 12 '21 at 07:51

answered Jan 22 '18 at 10:40

Vivek Kumar

35,217
8
109
132

1

I tried to access the link from this answer and got a 404 error – Mariane Reis Jul 30 '20 at 19:56
@MarianeReis Thanks for notifying me. I have now updated the link. – Vivek Kumar Jul 31 '20 at 06:44
Both links still take me to 404 pages. – agent18 Jan 11 '21 at 12:29
@agent18 The links have been updated again. Please check now. – Vivek Kumar Jan 12 '21 at 07:52

score 3 · Answer 2 · answered Jan 21 '18 at 18:30

This looks like it would fit the bill http://contrib.scikit-learn.org/imbalanced-learn/stable/generated/imblearn.over_sampling.SMOTE.html

You'll want to create your own transformer (http://scikit-learn.org/stable/modules/generated/sklearn.base.TransformerMixin.html) that upon calling fit returns a balanced data set (presumably the one gotten from StratifiedKFold), but upon calling predict, which is that is going to happen for the test data, calls into SMOTE.

How to implement SMOTE in cross validation and GridSearchCV

2 Answers2

Linked