0

I am trying to use Sklearn Pipeline methods before training multi ML models.

This is my code to for pipeline:

 def pipeline(self):
        self.numerical_features = self.X_train.select_dtypes(include='number').columns.tolist()
        print(f'There are {len(self.numerical_features)} numerical features:', '\n')
        print(self.numerical_features)
        self.categorical_features = self.X_train.select_dtypes(exclude='number').columns.tolist()
        print(f'There are {len(self.categorical_features)} categorical features:', '\n')
        print(self.categorical_features)
        #self.categorical_features = OneHotEncoder(handle_unknown='ignore')
        #Following pipeline will input missing values, and scale X_train
        self.numeric_pipeline = Pipeline(steps=[
            ('imputer', SimpleImputer(strategy='mean')),
            ('scale', MinMaxScaler())
        ])
        self.categorical_pipeline = Pipeline(steps=[
            ('imputer', SimpleImputer(strategy='most_frequent')),
            ('one-hot', OneHotEncoder(handle_unknown='ignore', sparse=False))
        ])
        try:
            self.full_processor  = ColumnTransformer(transformers=[
                                ('number', self.numeric_pipeline, self.numerical_features),
                                ('category', self.categorical_pipeline, self.categorical_features)
                            ])
            print(self.full_processor.fit_transform(self.X_train))
        except:
            print("Error occured: Check Pipeline")
    def lasso_estimator(self):
        self.lasso = Lasso(alpha=0.1)

        self.lasso_pipeline = Pipeline(steps=[
            ('preprocess', self.full_processor),
            ('model', self.lasso)
        ])
        try:
            self.model_fit = self.lasso_pipeline.fit(self.X_train, self.y_train)
            self.y_pred = self.model_fit.predict(self.X_test)
            self.mae = round(mean_absolute_error(self.y_test, self.y_pred), 3)
            print(f'Lasso Regression - MAE: {self.mae}')
            return self.lasso_pipeline
        except ValueError:
            print("Error occured while training lasso model")


def rf_estiimator(self):
        self.rf_model =  RandomForestClassifier()
        self.rf_pipeline = Pipeline(steps=[
            ('preprocess', self.full_processor),
            ('model', self.rf_model)
        ])
        print(self.rf_pipeline)
        self.rf_model_fit = self.rf_pipeline.fit(self.X_train, self.y_train)
        self.y_pred = self.rf_model_fit.predict(self.X_test)
        #get feature importance
        print(self.rf_pipeline[:-1].get_feature_names_out())
        print(self.rf_model_fit[-1].feature_importances_)

I have 8 numerical features and one categorical feature in my X_train data. I found that Categorical feature include character ?. I did try to replace this character with mean before using Pipeline.

When I tried to train with Randomforest and print out important features, it seems that OneHotEncoder is not working because it classified my categorical feature in 9 parts.

                                features  importance
0                number__Clump Thickness    0.077595
1        number__Uniformity of Cell Size    0.209922
2       number__Uniformity of Cell Shape    0.238910
3              number__Marginal Adhesion    0.036221
4   number__ Single Epithelial Cell Size    0.097657
5                number__Bland Chromatin    0.118026
6                number__Normal Nucleoli    0.078073
7                        number__Mitoses    0.015312
8                category__Bare Nuclei_1    0.060222
9               category__Bare Nuclei_10    0.036725
10               category__Bare Nuclei_2    0.002806
11               category__Bare Nuclei_3    0.001509
12               category__Bare Nuclei_4    0.003297
13               category__Bare Nuclei_5    0.004999
14               category__Bare Nuclei_6    0.002179
15               category__Bare Nuclei_7    0.003448
16               category__Bare Nuclei_8    0.002842
17               category__Bare Nuclei_9    0.001375
18               category__Bare Nuclei_?    0.008881

Which makes I have 19 features instead of 9 features.

How to get rid of this categorical conversion problem?

Alexander L. Hayes
  • 3,892
  • 4
  • 13
  • 34
Codeholic
  • 184
  • 1
  • 10
  • Are you asking why OHE creates more features (that's what it's supposed to do), or what's happening with the `?` category (answered below), or something else? – Ben Reiniger Sep 25 '22 at 15:10

1 Answers1

0

The default missing value in the SimpleImputer() method is np.nan. However, your missing values are represented by '?'. You can change the default missing value by setting missing_values argument; like this:

SimpleImputer(missing_values='?', strategy='mean')
  • Hi, thank you for your reply. It still gives me this error: could not convert string to float: '?'. – Codeholic Sep 24 '22 at 21:41
  • It works if I normalized and replaced my original X with 0 or mean or median. But doesn't work if I used normaliz in catagorical pipeline. Any idea to fix this issu? – Codeholic Sep 24 '22 at 22:03
  • Yes, that's true. `SimpleImputer` can't handle string when it needs to do some statistics such as `'mean'`, `'median'`, or `'most_frequent'` but it do works with other `'strategy=constant'` (which is not the case here). – Mohammad Tehrani Sep 25 '22 at 18:10
  • Thanks. I guess , I will have to go with more preprocessing before I use Pipeline. – Codeholic Sep 26 '22 at 02:16