Why accuracy of GridSearchCV method is lower than standard method?

Question

I use train_test_split (random_state = 0) and decision tree without any parameter tuning to model my data, I run it about 50 times to achieve the best accuracy.

import pandas as pd
import numpy as np

from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

Laptop = pd.ExcelFile(r"D:\Laptop.xlsx",  data_only=True)
data = pd.read_excel(r"D:\Laptop.xlsx",sheet_name=0)

train, test = train_test_split(data, test_size = 0.15)
print("Training size: {}; Test size: {}".format(len(train), len(test)))

c = DecisionTreeClassifier()

features = ["Brand", "Size", "CPU", "RAM", "Resolution", "Class"]

x_train = train[features]
y_train = train["K=20"]
x_test = test[features]
y_test = test["K=20"]

dt = c.fit(x_train, y_train)

y_pred = c.predict(x_test)

from sklearn.metrics import accuracy_score
score = accuracy_score(y_test, y_pred)*100

print ("Accuracy using Decision Tree:", round(score, 1), "%")

In the second step, I decided to use the GridSearchCV method to set the tree parameters.

import pandas as pd
import numpy as np

from sklearn import tree
from sklearn.model_selection import train_test_split

from matplotlib import pyplot as plt
from scipy.stats import randint
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import RandomizedSearchCV
%matplotlib inline

Laptop = pd.ExcelFile(r"D:\Laptop.xlsx",  data_only=True)
data = pd.read_excel(r"D:\Laptop.xlsx",sheet_name=0)

train, test = train_test_split(data, test_size = 0.15, random_state = 0)
print("Training size: {}; Test size: {}".format(len(train), len(test)))

features = ["Brand", "Size", "CPU", "RAM", "Resolution", "Class"]

x_train = train[features]
y_train = train["K=20"]
x_test = test[features]
y_test = test["K=20"]

from sklearn.model_selection import GridSearchCV

param_dist = {"max_depth":[10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20],
          "min_samples_leaf":randint (10,60)}

tree = DecisionTreeClassifier()
tree_cv = RandomizedSearchCV(tree, param_dist, cv=5)
tree_cv.fit(x_train, y_train)

print("Tuned Decisio Tree Parameters: {}".format(tree_cv.best_params_))
print("Best score is: {}".format(tree_cv.best_score_))

y_pred = tree_cv.predict(x_test)

from sklearn.metrics import accuracy_score
score = accuracy_score(y_test, y_pred)*100

print ("Accuracy using Decision Tree:", round(score, 1), "%")

my best accuracy in first method is very better than GridSearchCV method.

Why is this happening?

Do you know the best way to get the best tree with the most accuracy?

Please add the code and preferably part of your data, just enough to get a [MCVE]. — Maximilian Peters, Jul 12 '19 at 08:41
Share your work to create a [minimal reproducible example](https://stackoverflow.com/help/minimal-reproducible-example). — Sıddık Açıl, Jul 12 '19 at 08:42
make sure your search space include the default hyperparameters of the 1st method — Kurumi Tokisaki, Jul 12 '19 at 08:46
The best tree parameters in the first method are included in the range of parameters of the second method. @KurumiTokisaki — mina, Jul 13 '19 at 18:54

score 1 · Answer 1 · answered Jul 12 '19 at 09:02

Why is this happening?

Without seeing your code I can only speculate. It is probably based on the granularity of your grid. If you are making 50 combinations but there are billions of possible combinations then this is meaningless as a search space. Is there a way you can optimise what parameters you're searching for?

Do you know the best way to get the best tree with the most accuracy?

This is a hard question because you need to define accuracy. You can build a model which will overfit your test data. Technically the way to get the best tree is to try every possible combination of your hyperparameters, however for any reasonable number of parameters this will take forever. Generally your best method is to use a Bayesian approach to searching your hyperparameter space but you will return a distribution for each of your parameters. My advice would be to start with a RandomSearch rather than GridSearch. If you are a big fan of Skopt you can use BayesSearch. I recommend reading the code as I believe it is poorly documented.

import pandas as pd
import numpy as np
import xgboost as xgb
from skopt import BayesSearchCV
from sklearn.model_selection import StratifiedKFold

# SETTINGS - CHANGE THESE TO GET SOMETHING MEANINGFUL
ITERATIONS = 10 # 1000
TRAINING_SIZE = 100000 # 20000000
TEST_SIZE = 25000

# Classifier
bayes_cv_tuner = BayesSearchCV(
    estimator = xgb.XGBClassifier(
        n_jobs = 1,
        objective = 'binary:logistic',
        eval_metric = 'auc',
        silent=1,
        tree_method='approx'
    ),
    search_spaces = {
        'learning_rate': (0.01, 1.0, 'log-uniform'),
        'min_child_weight': (0, 10),
        'max_depth': (0, 50),
        'max_delta_step': (0, 20),
        'subsample': (0.01, 1.0, 'uniform'),
        'colsample_bytree': (0.01, 1.0, 'uniform'),
        'colsample_bylevel': (0.01, 1.0, 'uniform'),
        'reg_lambda': (1e-9, 1000, 'log-uniform'),
        'reg_alpha': (1e-9, 1.0, 'log-uniform'),
        'gamma': (1e-9, 0.5, 'log-uniform'),
        'min_child_weight': (0, 5),
        'n_estimators': (50, 100),
        'scale_pos_weight': (1e-6, 500, 'log-uniform')
    },    
    scoring = 'roc_auc',
    cv = StratifiedKFold(
        n_splits=3,
        shuffle=True,
        random_state=42
    ),
    n_jobs = 3,
    n_iter = ITERATIONS,   
    verbose = 0,
    refit = True,
    random_state = 42
)

result = bayes_cv_tuner.fit(X.values, y.values)

Skopt: https://scikit-optimize.github.io/

Code: https://github.com/scikit-optimize/scikit-optimize/blob/master/skopt/searchcv.py

score 0 · Answer 2 · answered Jul 12 '19 at 09:03

0

It depends on the parameter limits you specify for GridSearchCV.

The decision tree without any parameter has default values of parameters which is not in the range you manually specified. Choose a better set of parameters and try GridSearchCV again.

answered Jul 12 '19 at 09:03

amalik2205

3,962
1
15
21

1

The best tree parameters in the first method are included in the range of parameters of the second method. @amalik2205 – mina Jul 12 '19 at 13:24

Why accuracy of GridSearchCV method is lower than standard method?

2 Answers2