Decision tree too big Scikit Learn

Question

I have a data with 1025 inputs and 14 columns. First I set the label by putting them in separate tables.

x = dataset.drop('label', axis=1)
y = dataset['label']

The label values is only either 1 or 0. Then I split the data using:

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.30)

I then make my Classifier:

from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier()
classifier.fit(X_train, y_train)

Then whenever I make my Decision tree, it ends up too big:

from sklearn import tree
tree.plot_tree(classifier.fit(X_train, y_train))

The result outputs 8 levels and it gets too big. I thought this was okay but after observing the confusion matrix and classification report:

from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

It results to:

[[155   3]
[  3 147]]
precision    recall  f1-score   support
0       0.98      0.98      0.98       158
1       0.98      0.98      0.98       150

    accuracy                           0.98       308
   macro avg       0.98      0.98      0.98       308
weighted avg       0.98      0.98      0.98       308

The high accuracy makes me doubt my solution. What is wrong with my code and how can I tone down the decision tree and accuracy score?

Can you define what _too big_ means? Why do you want your decision tree to be inaccurate? — artemis, Dec 12 '19 at 14:28
The tree has 95 nodes and 8 levels. I think its branching out too much fails to generalize its decisions — qrnl, Dec 12 '19 at 14:31
So are you looking for ways to prevent _overfitting_ in your decision tree? We need to make sure the question is scoped appropriately so we can provide an adequate answer :) — artemis, Dec 12 '19 at 14:36
You can validate your model. You might also want to take a look at this: https://en.wikipedia.org/wiki/Hyperparameter_optimization — E. Zeytinci, Dec 12 '19 at 14:38
If this fixed your problem, please don't forget to mark as correct @rnlxs — artemis, Dec 16 '19 at 20:08

score 3 · Answer 1 · answered Dec 12 '19 at 14:47

It looks like what you need to do is check to make sure your tree is not overfitting. There are two primary ways we can accomplish this using Decision Trees and sklearn.

Validation Curves

First, you should check to make sure your tree is overfitting. You can do so using a validation curve (see here).

An example of a validation curve is below:

import numpy as np
from sklearn.model_selection import validation_curve
from sklearn.datasets import load_iris
from sklearn.linear_model import Ridge

np.random.seed(0)
X, y = load_iris(return_X_y=True)
indices = np.arange(y.shape[0])
np.random.shuffle(indices)
X, y = X[indices], y[indices]

train_scores, valid_scores = validation_curve(Ridge(), X, y, "alpha",
                                              np.logspace(-7, 3, 3),
                                              cv=5)
train_scores



valid_scores

Once you verify that your tree is overfitting, you need to do a thing called pruning, which you can accomplish using hyperparameter optimization as mentioned by @e-zeytinci. You can do that with GridSearchCV

GridSearchCV

GridSearchCV allows us to optimize the hyperparemeters of a decision tree, or any model, to look at things like maximum depth and maximum nodes (which seems to be OPs concerns), and also helps us to accomplish proper pruning.

An example of that implementation can be read here

An example set of working code taken from this post is below:

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV

def dtree_grid_search(X,y,nfolds):
    #create a dictionary of all values we want to test
    param_grid = { 'criterion':['gini','entropy'],'max_depth': np.arange(3, 15)}
    # decision tree model
    dtree_model=DecisionTreeClassifier()
    #use gridsearch to test all values
    dtree_gscv = GridSearchCV(dtree_model, param_grid, cv=nfolds)
    #fit model to data
    dtree_gscv.fit(X, y)
    return dtree_gscv.best_params_

Random Forests

Alternatively, Random Forests can help with Decision Tree overfitting.

You could implement a RandomForestClassifier and follow the same hyperparameter tuning outlined above.

An example from this post is below:

from sklearn.grid_search import GridSearchCV
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
# Build a classification task using 3 informative features
X, y = make_classification(n_samples=1000,
                           n_features=10,
                           n_informative=3,
                           n_redundant=0,
                           n_repeated=0,
                           n_classes=2,
                           random_state=0,
                           shuffle=False)


rfc = RandomForestClassifier(n_jobs=-1,max_features= 'sqrt' ,n_estimators=50, oob_score = True) 

param_grid = { 
    'n_estimators': [200, 700],
    'max_features': ['auto', 'sqrt', 'log2']
}

CV_rfc = GridSearchCV(estimator=rfc, param_grid=param_grid, cv= 5)
CV_rfc.fit(X, y)
print CV_rfc.best_params_

score 0 · Answer 2 · edited Dec 14 '19 at 16:25

You can validate your score of your decision tree, if you also include your train and test score (test you have already):

print(confusion_matrix(y_train, clf.predict(y_train))
print(classification_report(y_train, clf.predict(y_train))

If you have similar results for it, your tree is good fitting, in terms of accuracy (precision). You can also check this out for over-/and underfitting.

To the concept of over- and underfitting:

The blue curve is the error of training data, wherever the red curve is the test error, here you can see that the blue error goes down, wherever the red is stuck. This is overfitting - which means that the training data influences the data to much.

But your error for your test data is already low, which gives an indication that:

A function that is overfitted is likely to request more information about each item in the validation dataset than does the optimal function; gathering this additional unneeded data can be expensive or error-prone, especially if each individual piece of information must be gathered by human observation and manual data-entry.

Always remind yourself that only have 14 criteria available. The full parameters you can find here: https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

If you have such an accurate result for balanced data, I would ask myself if there is a feature (column) which directly influence your target variable. The key word is data leakage. This means that you have a feature which is only there because of your target variable and in a real test you would not have it in advance. One hint to get an idea would be: https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html

If you still have the feeling your tree is too depth, you can adjust your maximum depth with:

classifier = DecisionTreeClassifier(max_depth= 4)

You more than likely received a downvote since your answer is primarily links -- it would be more beneficial to OP to post code; examples OP can work from to help with their problem. — artemis, Dec 12 '19 at 14:52
you already have a test score, which is pretty good, the question you need to ask yourself, how good is your train score? — PV8, Dec 12 '19 at 14:52
for me this is more a concept question and not related to coding — PV8, Dec 12 '19 at 14:58

Decision tree too big Scikit Learn

2 Answers2

Validation Curves

GridSearchCV

Random Forests