For the given imbalanced data , I have created a different pipelines for standardization & one hot encoding
numeric_transformer = Pipeline(steps = [('scaler', StandardScaler())])
categorical_transformer = Pipeline(steps=['ohe', OneHotCategoricalEncoder()])
After that a column transformer keeping the above pipelines in one
from sklearn.compose import ColumnTransformer
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer,categorical_features)]
The final pipeline is as below
smt = SMOTE(random_state=42)
rf = pl1([('preprocessor', preprocessor),('smote',smt),
('classifier', RandomForestClassifier())])
I am doing the pipeline fit on imbalanced data so i have included the SMOTE technique along with the pre-processing and classifier. As it is imbalanced I want to check for the recall score.
Is the correct way as shown in the code below? I am getting recall around 0.98 which can cause the model to overfit. Any suggestions if I am making any mistake?
scores = cross_val_score(rf, X, y, cv=5,scoring="recall")