11

I'm relatively new to Python. Can you help me improve my implementation of SMOTE to a proper pipeline? What I want is to apply the over and under sampling on the training set of every k-fold iteration so that the model is trained on a balanced data set and evaluated on the imbalanced left out piece. The problem is that when I do that I cannot use the familiar sklearn interface for evaluation and grid search.

Is it possible to make something similar to model_selection.RandomizedSearchCV. My take on this:

df = pd.read_csv("Imbalanced_data.csv") #Load the data set
X = df.iloc[:,0:64]
X = X.values
y = df.iloc[:,64]
y = y.values
n_splits = 2
n_measures = 2 #Recall and AUC
kf = StratifiedKFold(n_splits=n_splits) #Stratified because we need balanced samples
kf.get_n_splits(X)
clf_rf = RandomForestClassifier(n_estimators=25, random_state=1)
s =(n_splits,n_measures)
scores = np.zeros(s)
for train_index, test_index in kf.split(X,y):
   print("TRAIN:", train_index, "TEST:", test_index)
   X_train, X_test = X[train_index], X[test_index]
   y_train, y_test = y[train_index], y[test_index]
   sm = SMOTE(ratio = 'auto',k_neighbors = 5, n_jobs = -1)
   smote_enn = SMOTEENN(smote = sm)
   x_train_res, y_train_res = smote_enn.fit_sample(X_train, y_train)
   clf_rf.fit(x_train_res, y_train_res)
   y_pred = clf_rf.predict(X_test,y_test)
   scores[test_index,1] = recall_score(y_test, y_pred)
   scores[test_index,2] = auc(y_test, y_pred)
Matti Lyra
  • 12,828
  • 8
  • 49
  • 67
MLearner
  • 164
  • 1
  • 1
  • 9
  • Have you found a solution to your problem? – Vivek Kumar Feb 01 '18 at 04:22
  • Yes, actually your comment helped me tremendously. Thank you very much! – MLearner Feb 03 '18 at 09:46
  • hi @VivekKumar does this method ensure that when running K-Fold CV that the validation set will not include over-sampled observations? I am trying to find a way where after I do a train/test split then oversampling on my training set that my validation set for each CV fold from the training set does not contain the bias from the oversampling. thanks! – thePurplePython May 21 '19 at 04:17
  • @thePurplePython Yes. You are correct. The `imblearn` pipeline will only call `sample()` method on training data and not on test data. The test data will be passed through without any changes. – Vivek Kumar May 21 '19 at 05:48

2 Answers2

17

You need to look at the pipeline object. imbalanced-learn has a Pipeline which extends the scikit-learn Pipeline, to adapt for the fit_sample() and sample() methods in addition to fit_predict(), fit_transform() and predict() methods of scikit-learn.

Have a look at this example here:

For your code, you would want to do this:

from imblearn.pipeline import make_pipeline, Pipeline

smote_enn = SMOTEENN(smote = sm)
clf_rf = RandomForestClassifier(n_estimators=25, random_state=1)

pipeline = make_pipeline(smote_enn, clf_rf)
    OR
pipeline = Pipeline([('smote_enn', smote_enn),
                     ('clf_rf', clf_rf)])

Then you can pass this pipeline object to GridSearchCV, RandomizedSearchCV or other cross validation tools in the scikit-learn as a regular object.

kf = StratifiedKFold(n_splits=n_splits)
random_search = RandomizedSearchCV(pipeline, param_distributions=param_dist,
                                   n_iter=1000, 
                                   cv = kf)
Vivek Kumar
  • 35,217
  • 8
  • 109
  • 132
3

This looks like it would fit the bill http://contrib.scikit-learn.org/imbalanced-learn/stable/generated/imblearn.over_sampling.SMOTE.html

You'll want to create your own transformer (http://scikit-learn.org/stable/modules/generated/sklearn.base.TransformerMixin.html) that upon calling fit returns a balanced data set (presumably the one gotten from StratifiedKFold), but upon calling predict, which is that is going to happen for the test data, calls into SMOTE.

Matti Lyra
  • 12,828
  • 8
  • 49
  • 67