1

When I am trying to fit a classification tree model using Survival~Sex+Pclass , it is not considering the Pclass and is only considering sex (when Survival, Sex, and Pclass are factored as shown in the code)no matter what the control parameter is specified.

Code:

library(titanic)
library(rpart)
library(rpart.plot)

train = titanic_train
titanic_train$Survived = factor(titanic_train$Survived)
titanic_train$Sex = factor(titanic_train$Sex)
titanic_train$Pclass = factor(titanic_train$Pclass)
ctrl=rpart.control(minsplit = 6, cp=0.001)
fit = rpart(Survived ~  Pclass + Sex , data = titanic_train,control=ctrl)
rpart.plot(fit)

https://i.stack.imgur.com/V50YE.png

rawr
  • 20,481
  • 4
  • 44
  • 78
  • it is not required that the optimal classification tree will use all of the variables in your model. if you just inserted random noise, you should be happy it is not a factor – rawr Mar 04 '21 at 18:57

3 Answers3

1

It really really doesn't want to split any further. Even setting cp = 0 doesn't do the trick (with minsplit = 1). But cp = -1 does, making the tree branch down to a leaf for each class. (Whether that's desirable or not is another story...)

enter image description here

Ben Reiniger
  • 10,517
  • 3
  • 16
  • 29
1

This is indeed an interesting observation since

  • we know that Pclass is a highly informative variable,
  • most other classification tree software will split further on Pclass (e.g. tree::tree, partykit::ctree, sklearn.tree.DecisionTreeClassifier, ...),
  • the regression tree version of the exact same code (i.e. NOT converting Survived to a factor but keeping it numeric.) results in 4 leaves, even though the Gini impurity is identical to the variance loss function for 0/1 data.

Also difficult to explain why for cp = 0 and minsplit = 1 the resulting tree would not be the deepest possible.

Markus Loecher
  • 367
  • 1
  • 16
1

The rpart author allowed me to use his answer, which I paste below:

train <- titanic_train
names(train) <- tolower(names(train))  # I'm lazy

train$pclass <- factor(train$pclass)

fit1 <- rpart(survived ~ pclass + sex, data=train)
fit2 <- rpart(survived ~ pclass + sex, data=train, method="class")
fit1

n= 891 

node), split, n, deviance, yval
      * denotes terminal node

1) root 891 210.727300 0.3838384  
  2) sex=male 577  88.409010 0.1889081  
    4) pclass=2,3 455  54.997800 0.1406593 *
    5) pclass=1 122  28.401640 0.3688525 *
  3) sex=female 314  60.105100 0.7420382  
    6) pclass=3 144  36.000000 0.5000000 *
    7) pclass=1,2 170   8.523529 0.9470588 *

fit2
n= 891 

node), split, n, loss, yval, (yprob)
      * denotes terminal node

1) root 891 342 0 (0.6161616 0.3838384)  
  2) sex=male 577 109 0 (0.8110919 0.1889081) *
  3) sex=female 314  81 1 (0.2579618 0.7420382) *

The issue: when you choose "classification" as the method, either explicitly like I did above or implicitly by setting the outcome to a factor, you have declared that the loss function is a simple "correct/incorrect" for alive/dead. For males, the survival rate is .189, which is < .5, so they class as 0. The next split below gives rates of .14 and .37, both of which are < .5, both are then treated as 0. The second split did not improve the model, according to the criteria that you chose. With or without it all males are a "0", so no need for the second split.

Ditto for the females: the overall and the two subclasses are both >= .5, so the second split does not improve prediction, according to the criteria that you selected.

When I leave the response as continuous, then the final criteria is MSE, and the further splits are counted as an improvement.

Markus Loecher
  • 367
  • 1
  • 16