Pipeline and GridSearchCV, and Multi-Class challenge for XGBoost and RandomForest

Question

I am working on workflows using Pipeline and GridSearchCV.

MWE for RandomForest, as below,

#################################################################
# Libraries
#################################################################
import time
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

#################################################################
# Data loading and Symlinks
#################################################################
train = pd.read_csv("data_train.csv")
test = pd.read_csv("data_test.csv")

#################################################################
# Train Test Split
#################################################################
# Selected features - Training data
X = train.drop(columns='fault_severity')

# Training data
y = train.fault_severity

# Test data
x = test

# Break off validation set from training data
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=0)

#################################################################
# Pipeline
#################################################################
pipe_rf = Pipeline([
    ('clf', RandomForestClassifier(random_state=0))
    ])

parameters_rf = {
        'clf__n_estimators':[30,40], 
        'clf__criterion':['entropy'], 
        'clf__min_samples_split':[15,20], 
        'clf__min_samples_leaf':[3,4]
    }

grid_rf = GridSearchCV(pipe_rf,
    param_grid=parameters_rf,
    scoring='neg_mean_absolute_error',
    cv=5,
    refit=True) 

#################################################################
# Modeling
#################################################################
start_time = time.time()

grid_rf.fit(X_train, y_train)

#Calculate the score once and use when needed
mae = grid_rf.score(X_valid,y_valid)

print("Best params                        : %s" % grid_rf.best_params_)
print("Best training data MAE score       : %s" % grid_rf.best_score_)    
print("Best validation data MAE score (*) : %s" % mae)
print("Modeling time                      : %s" % time.strftime("%H:%M:%S", time.gmtime(time.time() - start_time)))

#################################################################
# Prediction
#################################################################
#Predict using the test data with selected features
y_pred = grid_rf.predict(x)

# Transform numpy array to dataframe
y_pred = pd.DataFrame(y_pred)

# Rearrange dataframe
y_pred.columns = ['prediction']
y_pred.insert(0, 'id', x['id'])

# Save to CSV
y_pred.to_csv("data_predict.csv", index = False, header=True)
#Output
# id,prediction
# 11066,0
# 18000,2
# 16964,0
# ...., ....

Have a MWE for XGBoost as below,

#################################################################
# Libraries
#################################################################
import time
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import xgboost as xgb
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

#################################################################
# Data loading and Symlinks
#################################################################
train = pd.read_csv("data_train.csv")
test = pd.read_csv("data_test.csv")

#################################################################
# Train Test Split
#################################################################

# Selected features - Training data
X = train.drop(columns='fault_severity')

# Training data
y = train.fault_severity

# Test data
x = test

# Break off validation set from training data
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=0)

#################################################################
# DMatrix
#################################################################
dtrain = xgb.DMatrix(data=X_train, label=y_train)
dtest = xgb.DMatrix(data=test)

params = {
    'max_depth': 6,
    'objective': 'multi:softprob',  # error evaluation for multiclass training
    'num_class': 3,
    'n_gpus': 0
}

#################################################################
# Modeling
#################################################################
start_time = time.time()
bst = xgb.train(params, dtrain)

#################################################################
# Prediction
#################################################################
#Predict using the test data with selected features
y_pred = bst.predict(dtest)

# Transform numpy array to dataframe
y_pred = pd.DataFrame(y_pred)

# Rearrange dataframe
y_pred.columns = ['prediction_0', 'prediction_1', 'prediction_2']
y_pred.insert(0, 'id', x['id'])

# Save to CSV
y_pred.to_csv("data_predict_xgb.csv", index = False, header=True)
# Expected Output:
# id,prediction_0,prediction_1,prediction_2
# 11066,0.4674369,0.46609518,0.06646795
# 18000,0.7578633,0.19379888,0.048337903
# 16964,0.9296321,0.04505246,0.025315404
# ...., ...., ...., ....

Questions:

How does one convert the MWE for XGBoost using the Pipeline and GridSearchCV technique in MWE for RandomForest? Have to use 'num_class' where XGBRegressor() does not support.
How to have a multi-class prediction output for RandomForrest as XGBoost (i.e predict_0, predict_1, predict_2)? The sample output are given in the MWEs above. I found num_class is is not supported by RandomForest Classifier.

I have spent several days working on this and still been blocked. Appreciate some pointers to move forward.

Data:

data_train: https://www.dropbox.com/s/bnomyoidkcgyb2y/data_train.csv
data_test: https://www.dropbox.com/s/kn1bgde3hsf6ngy/data_test.csv

Chris · Accepted Answer · 2020-04-01T09:02:35.040

I presume in your first question, you did not mean to refer to XGBRegressor.

In order to allow an XGBClassifier to run in the pipeline, you simply need to change the initial definition of the pipeline:

params = {
    'max_depth': 6,
    'objective': 'multi:softprob',
    'num_class': 3,
    'n_gpus': 0
}
pipe_xgb = Pipeline([
    ('clf', xgb.XGBClassifier(**params))
])

(Note: I've changed the pipeline name to pipe_xgb, so you would need to change this in the rest of your code.)

As you can see from the answer to this question, XGBoost automatically switches to multiclass classification if there are more than two classes in the target variable. So you neither can, nor need to, specify num_class.

You should also change the metric to one for classification, as in each of your examples you use MAE which is a regression metric.

Here's a complete example of your code, using XGBClassifier with accuracy as the metric:

#################################################################
# Libraries
#################################################################
import time
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

import xgboost as xgb

#################################################################
# Data loading and Symlinks
#################################################################
train = pd.read_csv("https://dl.dropbox.com/s/bnomyoidkcgyb2y/data_train.csv?dl=0")
test = pd.read_csv("https://dl.dropbox.com/s/kn1bgde3hsf6ngy/data_test.csv?dl=0")

#################################################################
# Train Test Split
#################################################################
# Selected features - Training data
X = train.drop(columns='fault_severity')

# Training data
y = train.fault_severity

# Test data
x = test

# Break off validation set from training data
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=0)


#################################################################
# Pipeline
#################################################################
params = {
    'max_depth': 6,
    'objective': 'multi:softprob',  # error evaluation for multiclass training
    'num_class': 3,
    'n_gpus': 0
}
pipe_xgb = Pipeline([
    ('clf', xgb.XGBClassifier(**params))
    ])

parameters_xgb = {
        'clf__n_estimators':[30,40], 
        'clf__criterion':['entropy'], 
        'clf__min_samples_split':[15,20], 
        'clf__min_samples_leaf':[3,4]
    }

grid_xgb = GridSearchCV(pipe_xgb,
    param_grid=parameters_xgb,
    scoring='accuracy',
    cv=5,
    refit=True)

#################################################################
# Modeling
#################################################################
start_time = time.time()

grid_xgb.fit(X_train, y_train)

#Calculate the score once and use when needed
acc = grid_xgb.score(X_valid,y_valid)

print("Best params                        : %s" % grid_xgb.best_params_)
print("Best training data accuracy        : %s" % grid_xgb.best_score_)    
print("Best validation data accuracy (*)  : %s" % acc)
print("Modeling time                      : %s" % time.strftime("%H:%M:%S", time.gmtime(time.time() - start_time)))

#################################################################
# Prediction
#################################################################
#Predict using the test data with selected features
y_pred = grid_xgb.predict(X_valid)

# Transform numpy array to dataframe
y_pred = pd.DataFrame(y_pred)

# Rearrange dataframe
y_pred.columns = ['prediction']
y_pred.insert(0, 'id', x['id'])
accuracy_score(y_valid, y_pred.prediction)

Edit to address additional question in a comment.

You can use the predict_proba method of xgb's sklearn API to get probabilities for each class:

y_pred = pd.DataFrame(grid_xgb.predict_proba(X_valid),
                      columns=['prediction_0', 'prediction_1', 'prediction_2'])
y_pred.insert(0, 'id', x['id'])

With the above code, y_pred has the following format:

      id  prediction_0  prediction_1  prediction_2
0  11066      0.490955      0.436085      0.072961
1  18000      0.718351      0.236274      0.045375
2  16964      0.920252      0.052558      0.027190
3   4795      0.958216      0.021558      0.020226
4   3392      0.306204      0.155550      0.538246

Thanks. Learnt new things from you especially **params method. The expected output for y_pred as in my XGB MWE is #id,prediction_0,prediction_1,prediction_2 #11066,0.4674369,0.46609518,0.06646795 # 18000,0.7578633,0.19379888,0.048337903 .But got #id,prediction #11066,0 #18000,0 Is there a way to get it in the expected format. — Saravanan K, Apr 01 '20 at 08:29
@SaravananK Glad to be able to help. I've added an update to the answer – does that do the trick? — Chris, Apr 01 '20 at 09:09
It works like a charm. Studied further on predict_proba API for XGBoost. https://xgboost.readthedocs.io/en/latest/python/python_api.html Hunting for similar for RandomForestClassifier. 3am already, will continue research on this tomorrow..I mean later today. You have been very helpful. Thanks :-) — Saravanan K, Apr 01 '20 at 10:01
My pleasure. `RandomForestClassifier` also has a `predict_proba` method, so you should just be able to call it from your first example. See the documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier.predict_proba). — Chris, Apr 01 '20 at 10:04
I have studied and tested your proposal. Again, its flawless. Thanks — Saravanan K, Apr 01 '20 at 19:16

Pipeline and GridSearchCV, and Multi-Class challenge for XGBoost and RandomForest

1 Answers1