3

I use train_test_split (random_state = 0) and decision tree without any parameter tuning to model my data, I run it about 50 times to achieve the best accuracy.

import pandas as pd
import numpy as np

from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

Laptop = pd.ExcelFile(r"D:\Laptop.xlsx",  data_only=True)
data = pd.read_excel(r"D:\Laptop.xlsx",sheet_name=0)

train, test = train_test_split(data, test_size = 0.15)
print("Training size: {}; Test size: {}".format(len(train), len(test)))

c = DecisionTreeClassifier()

features = ["Brand", "Size", "CPU", "RAM", "Resolution", "Class"]

x_train = train[features]
y_train = train["K=20"]
x_test = test[features]
y_test = test["K=20"]

dt = c.fit(x_train, y_train)

y_pred = c.predict(x_test)

from sklearn.metrics import accuracy_score
score = accuracy_score(y_test, y_pred)*100

print ("Accuracy using Decision Tree:", round(score, 1), "%")

In the second step, I decided to use the GridSearchCV method to set the tree parameters.

import pandas as pd
import numpy as np

from sklearn import tree
from sklearn.model_selection import train_test_split

from matplotlib import pyplot as plt
from scipy.stats import randint
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import RandomizedSearchCV
%matplotlib inline

Laptop = pd.ExcelFile(r"D:\Laptop.xlsx",  data_only=True)
data = pd.read_excel(r"D:\Laptop.xlsx",sheet_name=0)

train, test = train_test_split(data, test_size = 0.15, random_state = 0)
print("Training size: {}; Test size: {}".format(len(train), len(test)))

features = ["Brand", "Size", "CPU", "RAM", "Resolution", "Class"]

x_train = train[features]
y_train = train["K=20"]
x_test = test[features]
y_test = test["K=20"]

from sklearn.model_selection import GridSearchCV

param_dist = {"max_depth":[10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20],
          "min_samples_leaf":randint (10,60)}

tree = DecisionTreeClassifier()
tree_cv = RandomizedSearchCV(tree, param_dist, cv=5)
tree_cv.fit(x_train, y_train)

print("Tuned Decisio Tree Parameters: {}".format(tree_cv.best_params_))
print("Best score is: {}".format(tree_cv.best_score_))

y_pred = tree_cv.predict(x_test)

from sklearn.metrics import accuracy_score
score = accuracy_score(y_test, y_pred)*100

print ("Accuracy using Decision Tree:", round(score, 1), "%")

my best accuracy in first method is very better than GridSearchCV method.

Why is this happening?

Do you know the best way to get the best tree with the most accuracy?

mina
  • 33
  • 1
  • 8

2 Answers2

1

Why is this happening?

Without seeing your code I can only speculate. It is probably based on the granularity of your grid. If you are making 50 combinations but there are billions of possible combinations then this is meaningless as a search space. Is there a way you can optimise what parameters you're searching for?

Do you know the best way to get the best tree with the most accuracy?

This is a hard question because you need to define accuracy. You can build a model which will overfit your test data. Technically the way to get the best tree is to try every possible combination of your hyperparameters, however for any reasonable number of parameters this will take forever. Generally your best method is to use a Bayesian approach to searching your hyperparameter space but you will return a distribution for each of your parameters. My advice would be to start with a RandomSearch rather than GridSearch. If you are a big fan of Skopt you can use BayesSearch. I recommend reading the code as I believe it is poorly documented.

import pandas as pd
import numpy as np
import xgboost as xgb
from skopt import BayesSearchCV
from sklearn.model_selection import StratifiedKFold

# SETTINGS - CHANGE THESE TO GET SOMETHING MEANINGFUL
ITERATIONS = 10 # 1000
TRAINING_SIZE = 100000 # 20000000
TEST_SIZE = 25000

# Classifier
bayes_cv_tuner = BayesSearchCV(
    estimator = xgb.XGBClassifier(
        n_jobs = 1,
        objective = 'binary:logistic',
        eval_metric = 'auc',
        silent=1,
        tree_method='approx'
    ),
    search_spaces = {
        'learning_rate': (0.01, 1.0, 'log-uniform'),
        'min_child_weight': (0, 10),
        'max_depth': (0, 50),
        'max_delta_step': (0, 20),
        'subsample': (0.01, 1.0, 'uniform'),
        'colsample_bytree': (0.01, 1.0, 'uniform'),
        'colsample_bylevel': (0.01, 1.0, 'uniform'),
        'reg_lambda': (1e-9, 1000, 'log-uniform'),
        'reg_alpha': (1e-9, 1.0, 'log-uniform'),
        'gamma': (1e-9, 0.5, 'log-uniform'),
        'min_child_weight': (0, 5),
        'n_estimators': (50, 100),
        'scale_pos_weight': (1e-6, 500, 'log-uniform')
    },    
    scoring = 'roc_auc',
    cv = StratifiedKFold(
        n_splits=3,
        shuffle=True,
        random_state=42
    ),
    n_jobs = 3,
    n_iter = ITERATIONS,   
    verbose = 0,
    refit = True,
    random_state = 42
)

result = bayes_cv_tuner.fit(X.values, y.values)

Skopt: https://scikit-optimize.github.io/

Code: https://github.com/scikit-optimize/scikit-optimize/blob/master/skopt/searchcv.py

Violatic
  • 374
  • 2
  • 18
0

It depends on the parameter limits you specify for GridSearchCV.

The decision tree without any parameter has default values of parameters which is not in the range you manually specified. Choose a better set of parameters and try GridSearchCV again.

amalik2205
  • 3,962
  • 1
  • 15
  • 21
  • 1
    The best tree parameters in the first method are included in the range of parameters of the second method. @amalik2205 – mina Jul 12 '19 at 13:24