1

I'm fairly new to decision trees and have a bit of trouble interpreting them when I move down branches. I have a few questions regarding the plot that was made on R. The response variable is Survived (Yes/No), which is to be predited by age, fare, number of siblings, and number of parents I attached a decision tree below using Kaggle's Titanic data-set.

  1. What do the different colors of green/blue mean?
  2. How do I interpret the leaf nodes?
  3. I understand the very top node inteprets to 38% survived, 62% did not survive, and 100% of the population is in that bucket. If I move to the right...how would I interpret Bucket #3? And if I kept going, Bucket #6? Etc etc...

Titanic Decision Tree

Sandipan Dey
  • 21,482
  • 2
  • 51
  • 63
Gene Nguyen
  • 49
  • 1
  • 5

1 Answers1

0

1) A node is colored according to the majority class corresponding to the node. The nodes that have majority class label as no (not survived) is colored green, otherwise blue (yes or survived).

2) Let's interpret the leftmost leaf node at he bottom. 83% of the datapoints corresponding to the node has class label no and 17% has class label yes. This nodes contains 62% datapoints from the entire dataset.

3) Bucket #3 can be similarly interpreted: 26% of the datapoints corresponding to the node has class label no and 74% has class label yes. This nodes contains 35% datapoints from the entire dataset. If you compute the weighted proportion of the no labels for node #2 and #3, you will get 0.65*0.81+0.35*0.26=0.6175~0.62, which is the proportion of the data in the root node that contains the label no.

Sandipan Dey
  • 21,482
  • 2
  • 51
  • 63
  • 1
    Really helpful, thank you. Would this interpretation of Bucket #3 be valid? 1) 74% survived if they were female (within 35% of the data) 2) 26% did not survive if they were female (within 35% of the data) – Gene Nguyen Feb 09 '17 at 19:10