What does clf.score(X_train,Y_train) evaluate in decision tree?

Question

I'm trying to build a decision tree, and found the following code online.

My question is:

What is clf.score(X_train,Y_train) evaluate for in decision tree? The output is in the following screenshot, I'm wondering what is that value for?

clf = DecisionTreeClassifier(max_depth=3).fit(X_train,Y_train)
print("Training:"+str(clf.score(X_train,Y_train)))
print("Test:"+str(clf.score(X_test,Y_test)))
pred = clf.predict(X_train)

Output:

And in the following code, I think it calculates several scores for the model. With higher max_depth I set, the score increase. That's easy to understand for me. However, I'm wondering what the difference between these number and the value for Training and Test in the previous screenshot?

My goal is to predict house price whether it's over 20k or not. Which score I should consider when choose the best-fit and simple model?

by default the clf.score uses the mean accuracy (your accuracy score). The metric will depend on the balance of the dataset and your level of acceptance of FP and FN. There is not only one answer. — Frayal, May 03 '19 at 13:58
Thank you@Alexis for your reply! That makes sense. I got another question. When I set max_depth as 5, the Training:0.89 Test:0.90, then I set max_depth as 6, it's like 0.899,0.91; With 7, it's like 0.88,0.89... I found the score will move around 0.9 after 5. Can I choose max_depth as the best fit model to predict final price? — Cloudy_Green, May 03 '19 at 14:01
yes you should chose the one giving you the best score on the test set. it is a hyper parameter optimisation. But be sure to cross validate to avoid the overfitting. I assume it the boston house dataset? if so look at kaggle solutions, some of them are great — Frayal, May 03 '19 at 14:06
Thank you sooo much @Alexis. Yeah, this is a Kaggle data but not Boston one - something similar. I will find some other good kernel for some inspirations! Thank you! You have a great one! :) — Cloudy_Green, May 03 '19 at 14:14

score 2 · Accepted Answer · edited Jun 20 '20 at 09:12

As correctly pointed out in the comments, it is the mean training accuracy indeed; you should have been able to guess that already, by simply comparing the four different scores in your 2nd screenshot with the training one in your 1st. But in any case, and before proceeding to open such questions here, you should first consult the relevant documentation, which is arguably your best friend in similar cases. Quoting from the score method of the scikit-learn DecisionTreeClassifier docs:

score (X, y, sample_weight=None)

Returns the mean accuracy on the given test data and labels.

What does clf.score(X_train,Y_train) evaluate in decision tree?

1 Answers1