6

I apply decision tree with K-fold using sklearn and someone can help me to show the average score of it. Below is my code:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import KFold
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix,classification_report

dta=pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/blood-transfusion/transfusion.data")

X=dta.drop("whether he/she donated blood in March 2007",axis=1)

X=X.values # convert dataframe to numpy array

y=dta["whether he/she donated blood in March 2007"]

y=y.values # convert dataframe to numpy array

kf = KFold(n_splits=10)

KFold(n_splits=10, random_state=None, shuffle=False)

clf_tree=DecisionTreeClassifier()

for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    clf=clf_tree.fit(X_train,y_train)
    print("classification_report_tree", 
           classification_report(y_test,clf_tree.predict(X_test)))
Vivek Kumar
  • 35,217
  • 8
  • 109
  • 132
  • What do you mean by `average score`? Do you only want the accuracy ? Or the recall, precision and f1 also (as you are printing the classification report). – Vivek Kumar Nov 13 '17 at 06:35
  • I want to run decision tree with K Fold and show overall accuracy, with k fold is 10 that will run 10 times and give us 10 accuracy each running time. How to show overall accuracy of training – Ngọc Vũ Đình Nov 13 '17 at 07:48

2 Answers2

6

If you only want accuracy, then you can simply use cross_val_score()

kf = KFold(n_splits=10)
clf_tree=DecisionTreeClassifier()
scores = cross_val_score(clf_tree, X, y, cv=kf)

avg_score = np.mean(score_array)
print(avg_score)

Here cross_val_score will take as input your original X and y (without splitting into train and test). cross_val_score will automatically split them into train and test, fit the model on train data and score on test data. And those scores will be returned in the scores variable.

So when you have 10 folds, 10 scores will be returned in scores variable. You can then just take an average of that.

Vivek Kumar
  • 35,217
  • 8
  • 109
  • 132
4

You can try Precision_reacll_fscore_support metric from sklearn and then get average the results for each fold per class. I am assuming here that you need the scores average per class.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import KFold
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import precision_recall_fscore_support
from sklearn.model_selection import GridSearchCV,cross_val_score

dta=pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/blood-transfusion/transfusion.data")

X=dta.drop("whether he/she donated blood in March 2007",axis=1)

X=X.values # convert dataframe to numpy array

y=dta["whether he/she donated blood in March 2007"]

y=y.values # convert dataframe to numpy array

kf = KFold(n_splits=10)

KFold(n_splits=10, random_state=None, shuffle=False)

clf_tree=DecisionTreeClassifier()

score_array =[]
for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    clf=clf_tree.fit(X_train,y_train)
    y_pred = clf.predict(X_test)
    score_array.append(precision_recall_fscore_support(y_test, y_pred, average=None))

avg_score = np.mean(score_array,axis=0)
print(avg_score)

#Output:
#[[  0.77302466   0.30042282]
# [  0.81755068   0.22192344]
# [  0.79063779   0.24414489]
# [ 57.          17.8       ]]

Now to get precision of class 0, you can use avg_score[0][0]. The recall can be accessed by the second row (i.e. for class 0, it is avg_score[1][0]), while the fscore and support can be accessed from 3rd and 4th row respectively.

Gambit1614
  • 8,547
  • 1
  • 25
  • 51
  • 2
    while the other answer is technically correct, this answer also shows how to actually train the model! :) – jcr Nov 12 '18 at 12:34