11

I continue to investigate about pipeline. My aim is to execute each step of machine learning only with pipeline. It will be more flexible and easier to adapt my pipeline with an other use case. So what I do:

  • Step 1: Fill NaN Values
  • Step 2: Transforming Categorical Values into Numbers
  • Step 3: Classifier
  • Step 4: GridSearch
  • Step 5: Add a metrics (failed)

Here is my code:

import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_selection import SelectKBest
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score


class FillNa(BaseEstimator, TransformerMixin):

    def transform(self, x, y=None):
            non_numerics_columns = x.columns.difference(
                x._get_numeric_data().columns)
            for column in x.columns:
                if column in non_numerics_columns:
                    x.loc[:, column] = x.loc[:, column].fillna(
                        df[column].value_counts().idxmax())
                else:
                    x.loc[:, column] = x.loc[:, column].fillna(
                        x.loc[:, column].mean())
            return x

    def fit(self, x, y=None):
        return self


class CategoricalToNumerical(BaseEstimator, TransformerMixin):

    def transform(self, x, y=None):
        non_numerics_columns = x.columns.difference(
            x._get_numeric_data().columns)
        le = LabelEncoder()
        for column in non_numerics_columns:
            x.loc[:, column] = x.loc[:, column].fillna(
                x.loc[:, column].value_counts().idxmax())
            le.fit(x.loc[:, column])
            x.loc[:, column] = le.transform(x.loc[:, column]).astype(int)
        return x

    def fit(self, x, y=None):
        return self


class Perf(BaseEstimator, TransformerMixin):

    def fit(self, clf, x, y, perf="all"):
        """Only for classifier model.

        Return AUC, ROC, Confusion Matrix and F1 score from a classifier and df
        You can put a list of eval instead a string for eval paramater.
        Example: eval=['all', 'auc', 'roc', 'cm', 'f1'] will return these 4
        evals.
        """
        evals = {}
        y_pred_proba = clf.predict_proba(x)[:, 1]
        y_pred = clf.predict(x)
        perf_list = perf.split(',')
        if ("all" or "roc") in perf.split(','):
            fpr, tpr, _ = roc_curve(y, y_pred_proba)
            roc_auc = round(auc(fpr, tpr), 3)
            plt.style.use('bmh')
            plt.figure(figsize=(12, 9))
            plt.title('ROC Curve')
            plt.plot(fpr, tpr, 'b',
                     label='AUC = {}'.format(roc_auc))
            plt.legend(loc='lower right', borderpad=1, labelspacing=1,
                       prop={"size": 12}, facecolor='white')
            plt.plot([0, 1], [0, 1], 'r--')
            plt.xlim([-0.1, 1.])
            plt.ylim([-0.1, 1.])
            plt.ylabel('True Positive Rate')
            plt.xlabel('False Positive Rate')
            plt.show()

        if "all" in perf_list or "auc" in perf_list:
            fpr, tpr, _ = roc_curve(y, y_pred_proba)
            evals['auc'] = auc(fpr, tpr)

        if "all" in perf_list or "cm" in perf_list:
            evals['cm'] = confusion_matrix(y, y_pred)

        if "all" in perf_list or "f1" in perf_list:
            evals['f1'] = f1_score(y, y_pred)

        return evals


path = '~/proj/akd-doc/notebooks/data/'
df = pd.read_csv(path + 'titanic_tuto.csv', sep=';')
y = df.pop('Survival-Status').replace(to_replace=['dead', 'alive'],
                                      value=[0., 1.])
X = df.copy()
X_train, X_test, y_train, y_test = train_test_split(
    X.copy(), y.copy(), test_size=0.2, random_state=42)

percent = 0.50
nb_features = round(percent * df.shape[1]) + 1
clf = RandomForestClassifier()
pipeline = Pipeline([('fillna', FillNa()),
                     ('categorical_to_numerical', CategoricalToNumerical()),
                     ('features_selection', SelectKBest(k=nb_features)),
                     ('random_forest', clf),
                     ('perf', Perf())])

params = dict(random_forest__max_depth=list(range(8, 12)),
              random_forest__n_estimators=list(range(30, 110, 10)))
cv = GridSearchCV(pipeline, param_grid=params)
cv.fit(X_train, y_train)

I am aware that it is not ideal to print a roc curve but that's not the problem right now.

So, when I execute this code I have:

TypeError: If no scoring is specified, the estimator passed should have a 'score' method. The estimator Pipeline(steps=[('fillna', FillNa()), ('categorical_to_numerical', CategoricalToNumerical()), ('features_selection', SelectKBest(k=10, score_func=<function f_classif at 0x7f4ed4c3eae8>)), ('random_forest', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None,...=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)), ('perf', Perf())]) does not.

I'm interested in all ideas...

Vivek Kumar
  • 35,217
  • 8
  • 109
  • 132
Jeremie Guez
  • 303
  • 2
  • 11

1 Answers1

5

As the error states, you need to specify the scoring parameter in GridSearchCV.

Use

GridSearchCV(pipeline, param_grid=params, scoring = 'accuracy')

Edit (Based on questions in comments):

If you need the roc, auc curve and f1 for the entire X_train and y_train (and not for all the splits of GridSearchCV), its better to keep the Perf class out of the pipeline.

pipeline = Pipeline([('fillna', FillNa()),
                     ('categorical_to_numerical', CategoricalToNumerical()),
                     ('features_selection', SelectKBest(k=nb_features)),
                     ('random_forest', clf)])

#Fit the data in the pipeline
pipeline.fit(X_train, y_train)

performance_meas = Perf()
performance_meas.fit(pipeline, X_train, y_train)
Vivek Kumar
  • 35,217
  • 8
  • 109
  • 132
  • Great ! But it's not possible to plot my roc curve in this way ?! And it will be possible to get accuracy and f1 score in the same pipeline ? – Jeremie Guez May 04 '17 at 15:56
  • Yes its possible. Are you not getting the results? Upon further inspection of your code, it seems like it will give another error even after solving this one. – Vivek Kumar May 04 '17 at 16:01
  • If i delete my `Class Perf` and call `cv = GridSearchCV(pipeline, param_grid=params, scoring = 'accuracy') cv.fit(X_train, y_train)` I don't have any errors. I am trying to find a way to get roc, auc, f1_score with the same run – Jeremie Guez May 04 '17 at 16:05
  • I did not understand. You can get any score metric (f1, accuracy, recall), but the question is what do you want to use with GridSearchCV.? – Vivek Kumar May 04 '17 at 16:07
  • See, when used the Perf inside the pipeline, along with GridSearchCV, it means that you want the scores for all splits that the GridSearchCV will do on the data. If you want to access all these scores for all your data, its better to keep it out of pipeline. Did you get my point? – Vivek Kumar May 04 '17 at 16:09
  • Yes I understand but I don't want the scores for all splits. I want to plot a roc curve, get auc score and f1_score – Jeremie Guez May 04 '17 at 16:12
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/143419/discussion-between-jeremie-guez-and-vivek-kumar). – Jeremie Guez May 04 '17 at 16:13