cross validation + decision trees in sklearn

Question

Attempting to create a decision tree with cross validation using sklearn and panads.

My question is in the code below, the cross validation splits the data, which i then use for both training and testing. I will be attempting to find the best depth of the tree by recreating it n times with different max depths set. In using cross validation should i instead be using k folds CV and if so how would I use that within the code I have?

import numpy as np
import pandas as pd
from sklearn import tree
from sklearn import cross_validation

features = ["fLength", "fWidth", "fSize", "fConc", "fConc1", "fAsym", "fM3Long", "fM3Trans", "fAlpha", "fDist", "class"]

df = pd.read_csv('magic04.data',header=None,names=features)

df['class'] = df['class'].map({'g':0,'h':1})

x = df[features[:-1]]
y = df['class']

x_train,x_test,y_train,y_test = cross_validation.train_test_split(x,y,test_size=0.4,random_state=0)

depth = []
for i in range(3,20):
    clf = tree.DecisionTreeClassifier(max_depth=i)
    clf = clf.fit(x_train,y_train)
    depth.append((i,clf.score(x_test,y_test)))
print depth

here is a link to the data that i am using in case that helps anyone. https://archive.ics.uci.edu/ml/datasets/MAGIC+Gamma+Telescope

Dimosthenis · Accepted Answer · 2018-03-18T16:49:54.720

In your code you are creating a static training-test split. If you want to select the best depth by cross-validation you can use sklearn.cross_validation.cross_val_score inside the for loop.

You can read sklearn's documentation for more information.

Here is an update of your code with CV:

import numpy as np
import pandas as pd
from sklearn import tree
from sklearn.cross_validation import cross_val_score
from pprint import pprint

features = ["fLength", "fWidth", "fSize", "fConc", "fConc1", "fAsym", "fM3Long", "fM3Trans", "fAlpha", "fDist", "class"]

df = pd.read_csv('magic04.data',header=None,names=features)
df['class'] = df['class'].map({'g':0,'h':1})

x = df[features[:-1]]
y = df['class']

# x_train,x_test,y_train,y_test = cross_validation.train_test_split(x,y,test_size=0.4,random_state=0)
depth = []
for i in range(3,20):
    clf = tree.DecisionTreeClassifier(max_depth=i)
    # Perform 7-fold cross validation 
    scores = cross_val_score(estimator=clf, X=x, y=y, cv=7, n_jobs=4)
    depth.append((i,scores.mean()))
print(depth)

Alternatively, you can use sklearn.grid_search.GridSearchCV and not write the for loop yourself, especially if you want to optimize for more than one hyper-parameter.

import numpy as np
import pandas as pd
from sklearn import tree
from sklearn.model_selection import GridSearchCV

features = ["fLength", "fWidth", "fSize", "fConc", "fConc1", "fAsym", "fM3Long", "fM3Trans", "fAlpha", "fDist", "class"]

df = pd.read_csv('magic04.data',header=None,names=features)
df['class'] = df['class'].map({'g':0,'h':1})

x = df[features[:-1]]
y = df['class']


parameters = {'max_depth':range(3,20)}
clf = GridSearchCV(tree.DecisionTreeClassifier(), parameters, n_jobs=4)
clf.fit(X=x, y=y)
tree_model = clf.best_estimator_
print (clf.best_score_, clf.best_params_)

Edit: changed how GridSearchCV is imported to accommodate learn2day's comment.

+1 for answering the question asked and also suggesting grid search, which is definitely the better practice for this type of problem — dsal1951, Aug 31 '16 at 01:29
`grid_search` is deprecated since 0.18, and removed since 0.20. You should now use `GridSearchCV` from `sklearn.model_selection` — learn2day, Mar 29 '17 at 01:44
@Dimosthenis How will the model be validated on a _test_ dataset since all the data is used in training the model ? — Rookie_123, Nov 26 '18 at 05:51
Or shall we keep a part of dataset as a _test_ dataset and should not use that even for cross validation — Rookie_123, Nov 26 '18 at 05:59
@Rookie_123 If you choose to use cross validation to optimize the model's hyper parameters then it's better to do a train/test split first, train and do cross validation on the training set, and test at the end on the first test set you created. `sklearn.model_selection.train_test_split` is handy for the train test split — Dimosthenis, Nov 28 '18 at 12:47

cross validation + decision trees in sklearn

1 Answers1