0

For an imbalanced classification problem, I am using an imblearn pipeline along with sklearn's GridSearchCV (to find best hyper-params). The steps in the pipeline are as follows:

  1. Standardize each feature
  2. Correct for class imbalance by using ADASYN sampling
  3. Train Random Forest Classifier

A hyper-parameter search is being done on the above pipeline by using GridSearchCV (along with stratified cv). The hyper-parameter search space includes hyper-params from both ADASYN and Random Forest.

While the above works great for choosing the best hyper-parameters during train-validation split, I think that it would be erroneous to apply the same pipeline while predicting on test data set.

The reason is that for predicting on test data set, we should not use ADASYN sampling. The test data set should be predicted as is, without any sampling. Therefore, the pipeline for prediction should be:

  1. Standardize each feature
  2. ADASYN sampling
  3. Predict using trained Random Forest Classifier

How can I use the sklearn/imblearn API to ignore a specific transform in the pipeline in this manner?

My code (expresses same problem as above):

import pandas as pd
from imblearn.pipeline import Pipeline as imbPipeline
from imblearn.over_sampling import ADASYN
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold
from sklearn.ensemble import RandomForestClassifier

# Get data
df = pd.read_csv('train.csv')
y_col = 'output'
x_cols = [c for c in df.columns if c != y_col]

# Train and Test data sets
train, test = train_test_split(df, shuffle=True, stratify=df[y_col])

# Define pipeline of transforms and model
pl = imbPipeline([('std', StandardScaler()),
                  ('sample', ADASYN()),
                  ('rf', RandomForestClassifier())])

# Additional code to define params for grid-search omitted.
# params will contain hyper-parameters for ADASYN as well as random forest

# grid search cv
cv = GridSearchCV(pl, params, scoring='f1', n_jobs=-1)
cv.fit(train[x_cols], train[y_col])

# Now that the grid search has been done and the object cv contains the
# best hyper-parameters, I would like to test on test data set:

test_pred = cv.predict(test[x_cols])  # WRONG! No need to do ADASYN sampling!
sophros
  • 14,672
  • 11
  • 46
  • 75
Chaos
  • 466
  • 1
  • 5
  • 12
  • I wouldn't think it's necessary to do your ADASYN _after_ you standardize your features, in fact you may get results that fit better to the actual data if it's not scaled. If it were me, I'd do the oversample outside of the pipeline, then you can use the pipeline as normal – G. Anderson Oct 29 '18 at 15:02
  • 1
    Well, the folks at imblearn feel exactly the same, so its been taken care of. When you call `predict()` or `transform()` on imblearn pipeline, it skips the sampling part automatically. See my other answer [here for details](https://stackoverflow.com/a/50245954/3374996) (and the answer linked in that answer). – Vivek Kumar Oct 30 '18 at 06:19
  • Thank you, @VivekKumar! In hindsight, this should have been obvious to me, since test_pred.shape has the same number of rows as test[x_cols].shape. :) – Chaos Oct 30 '18 at 09:14

0 Answers0