1

I would like to create a Pipeline with SMOTE() inside, but I can't figure out where to implement it. My target value is imbalanced. Without SMOTE I have very bad results.

My code:

df_n = df[['user_id','signup_day', 'signup_month', 'signup_year', 
    'purchase_day', 'purchase_month', 'purchase_year','purchase_value',
    'source','browser','sex','age', 'is_fraud']]

#Definition X et y:
X = df_n.drop(['is_fraud'], axis = 1)
y = df_n.is_fraud

# split into 70:30 ration
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)

print(Counter(y_train)) #Counter({0: 95844, 1: 9934})

numeric_transformer = Pipeline(steps=[
       ('imputer', SimpleImputer(strategy='mean'))
      ,('scaler', StandardScaler())
])
categorical_transformer = Pipeline(steps=[
       ('imputer', SimpleImputer(strategy='constant'))
      ,('encoder', OrdinalEncoder())
])

numeric_features = ['user_id','signup_day', 'signup_month', 'signup_year',
        'purchase_day', 'purchase_month', 'purchase_year','purchase_value', 'age']

categorical_features = ['source', 'browser', 'sex']

preprocessor = ColumnTransformer(
   transformers=[
    ('numeric', numeric_transformer, numeric_features)
   ,('categorical', categorical_transformer, categorical_features)
]) 

regressors = [
    RandomForestRegressor()
   ,LogisticRegression()
   ,DecisionTreeClassifier()
   ,KNeighborsClassifier()
   ,LinearSVC(random_state=42)]

for regressor in regressors:
    pipeline = Pipeline(steps = [
               ('preprocessor', preprocessor)
              ,('regressor',regressor)
           ])
    model = pipeline.fit(X_train, y_train)
    predictions = model.predict(X_test)
    print(regressor)
    print(r2_score(y_test, predictions))

My results:

RandomForestRegressor()
0.48925960579049166
LogisticRegression()
0.24151543370722806
DecisionTreeClassifier()
-0.14622417739659155
KNeighborsClassifier()
0.3542030752350408
LinearSVC(random_state=42)
-0.10256098450762474
Flavia Giammarino
  • 7,987
  • 11
  • 30
  • 40
Anastasia_data
  • 27
  • 1
  • 1
  • 8
  • Just replace `from sklearn.pipeline import Pipeline` by `from imblearn.pipeline import Pipeline`, the version of `Pipeline` in `imblearn` allows `SMOTE` combined with the usual steps of scikit-learn – RafaelCaballero May 24 '23 at 10:40

3 Answers3

3

You can use below code for adding SMOTE in pipeline (need some tweaking though)

from imblearn.pipeline import Pipeline

# define pipeline
model = DecisionTreeClassifier()
over = SMOTE(sampling_strategy=0.1)
under = RandomUnderSampler(sampling_strategy=0.5)
steps = [('over', over), ('under', under), ('model', model)]
pipeline = Pipeline(steps=steps)

# evaluate pipeline
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
scores = cross_val_score(pipeline, X, Y, scoring='roc_auc', cv=cv, n_jobs=-1)
Flavia Giammarino
  • 7,987
  • 11
  • 30
  • 40
Abhishek
  • 1,585
  • 2
  • 12
  • 15
2
from imblearn.over_sampling import SMOTEN
sampler = SMOTEN(random_state=0)
Xsm,ysm = sampler.fit_resample(X, y)
chrslg
  • 9,023
  • 5
  • 17
  • 31
1

treat smote separately not inside pipeline by using this code

What you can do is use a modification of the SMOTE algorithm, called SMOTE-N (see https://imbalanced-learn.org/dev/over_sampling.html#smote-variants), which works when all features are categorical. This modifies the SMOTE algorithm to