1

I want to have a process which as result gives me a list of machine learning models and their accuracy score but only for the set of params which gives the best result of that type of model.

As example, here just the CV for XGBoost:

dataset:

import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
iris = load_iris()
data = pd.DataFrame(data= np.c_[iris['data'], iris['target']],
                     columns= iris['feature_names'] + ['target'])

from sklearn.model_selection import train_test_split
X = data.drop(['target'], axis=1)
y = data['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

function for finding best params:

from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, make_scorer
accu = make_scorer(accuracy_score) # I will be using f1 in future

def predict_for_best_params(alg, X_train, y_train, X_test):
    params = {'n_estimators': [200, 300, 500]}
    clf = GridSearchCV(alg, params, scoring = accu, cv=2)
    clf.fit(X_train, y_train)
    print(clf.best_estimator_)
    y_pred = clf.predict(X_test)
    return y_pred

using it on one model:

from xgboost import XGBClassifier
alg = [XGBClassifier()]
y_pred = predict_for_best_params(alg[0], X_train, y_train, X_test)

from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_pred))

What I want to achieve is something like:

from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier

alg = [XGBClassifier(), RandomForrest()] # list of many of them

alg_params = {'XGBClassifier': [{'n_estimators': [200, 300, 500]}],
             'RandomForrest': [{'max_depth ': [1, 2, 3, 4]}]}

def predict_for_best_params(alg, X_train, y_train, X_test, params):
    clf = GridSearchCV(alg, params, scoring = accu, cv=2)
    clf.fit(X_train, y_train)
    print(clf.best_estimator_)
    y_pred = clf.predict(X_test)
    return y_pred

for algo in alg:
    params = alg_params[str(algo)][0] #this won't work because str(algo) <> e.g. XGBClassifier() but XGBClassier(all default params)
    y_pred = predict_for_best_params(algo, X_train, y_train, X_test, params)
    print('{} accuracy is: {}'.format(algo, accuracy_score(y_test, y_pred)))

Is this a good way to achieve it?

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Mateusz Konopelski
  • 1,000
  • 4
  • 20
  • 37
  • Please refrain from asking for "the best way of [doing something]" - asking how to do it should be more than enough... – desertnaut Oct 31 '18 at 13:38

1 Answers1

3

If you are only worried about how to put the key, then you can use

params = alg_params[alg.__class__.__name__][0] 

This should return only the class name for the alg object

For an alternate approach, you can look at my other answer :

That answer makes use of the fact that GridSearchCV can take a list of dicts of paramter combinations, in which each list will be expanded separately. But note the following things:

  • This can be faster than your current for-loop if you use n_jobs > 1 (use the multi-processing).
  • You can then use the cv_results_ attribute of the completed GridSearchCV to analyse the scores.
  • To calculate the y_pred for individual estimators, you can filter the cv_results_ (maybe by importing it in a pandas DataFrame), and then fitting the estimator with best found parameters again, and then calculating the y_pred. But should be pretty easy.
Vivek Kumar
  • 35,217
  • 8
  • 109
  • 132
  • Hi. I think your answer(s) will be helpful for my question also - https://stackoverflow.com/questions/55468376/how-do-i-change-using-for-loops-to-call-multiple-functions-into-using-a-pi - Can you please have a look at it? I think my trouble right now is the `y_pred` part. – scientific_explorer Apr 03 '19 at 11:49