2

I'm trying to build a decision tree, and found the following code online.

My question is:

  • What is clf.score(X_train,Y_train) evaluate for in decision tree? The output is in the following screenshot, I'm wondering what is that value for?

    clf = DecisionTreeClassifier(max_depth=3).fit(X_train,Y_train)
    print("Training:"+str(clf.score(X_train,Y_train)))
    print("Test:"+str(clf.score(X_test,Y_test)))
    pred = clf.predict(X_train)
    

    Output:

    enter image description here

  • And in the following code, I think it calculates several scores for the model. With higher max_depth I set, the score increase. That's easy to understand for me. However, I'm wondering what the difference between these number and the value for Training and Test in the previous screenshot?

enter image description here

  • My goal is to predict house price whether it's over 20k or not. Which score I should consider when choose the best-fit and simple model?
desertnaut
  • 57,590
  • 26
  • 140
  • 166
Cloudy_Green
  • 113
  • 1
  • 1
  • 8
  • 1
    by default the clf.score uses the mean accuracy (your accuracy score). The metric will depend on the balance of the dataset and your level of acceptance of FP and FN. There is not only one answer. – Frayal May 03 '19 at 13:58
  • Thank you@Alexis for your reply! That makes sense. I got another question. When I set max_depth as 5, the Training:0.89 Test:0.90, then I set max_depth as 6, it's like 0.899,0.91; With 7, it's like 0.88,0.89... I found the score will move around 0.9 after 5. Can I choose max_depth as the best fit model to predict final price? – Cloudy_Green May 03 '19 at 14:01
  • 1
    yes you should chose the one giving you the best score on the test set. it is a hyper parameter optimisation. But be sure to cross validate to avoid the overfitting. I assume it the boston house dataset? if so look at kaggle solutions, some of them are great – Frayal May 03 '19 at 14:06
  • 1
    Thank you sooo much @Alexis. Yeah, this is a Kaggle data but not Boston one - something similar. I will find some other good kernel for some inspirations! Thank you! You have a great one! :) – Cloudy_Green May 03 '19 at 14:14

1 Answers1

2

As correctly pointed out in the comments, it is the mean training accuracy indeed; you should have been able to guess that already, by simply comparing the four different scores in your 2nd screenshot with the training one in your 1st. But in any case, and before proceeding to open such questions here, you should first consult the relevant documentation, which is arguably your best friend in similar cases. Quoting from the score method of the scikit-learn DecisionTreeClassifier docs:

score (X, y, sample_weight=None)

Returns the mean accuracy on the given test data and labels.

Community
  • 1
  • 1
desertnaut
  • 57,590
  • 26
  • 140
  • 166