1

I have a dataset with 9 features, from x1 to x9. Target variable is Target (I have a classification problem). The code:

# Splitting the dataset into the Training set and Test set
# install.packages('caTools')
library(caTools)
set.seed(123)
split = sample.split(dataset$Target, SplitRatio = 0.75)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)

training_set[-c(2,5)] = scale(training_set[-c(2,5)])
test_set[-c(2,5)] = scale(test_set[-c(2,5)])


# Fitting Decision Tree Classification to the Training set
# install.packages('rpart')
library(rpart)
classifier = rpart(formula = Target ~ .,
                   data = training_set)

# Predicting the Test set results
y_pred = predict(classifier, newdata = test_set[-2], type = 'class')

# Making the Confusion Matrix
cm = table(test_set[, 2], y_pred)

plot(classifier, uniform=TRUE,margin=0.2)
text(classifier)

produces:

enter image description here

Anyway, I see 7 variables sorted by importance. The first question is: why only 7 (they are 9)?

summary(classifier)


Variable importance
x7 x6 x4 x1 x3 x2 x5 
27 18 17 14 11  9  4

Moreover (this is the second questions) x3 is missing in the plot. Why?

The dataset is too big and I think I can't put it here, but I wanted to know if something similar has happened to you and if you have found any possible explanations.

Thank you!

Mark
  • 1,577
  • 16
  • 43
  • It will be difficult to give a specific answer without more details about the structure of `dataset` : factors, continuous variables, ... – Waldi May 17 '21 at 08:16

1 Answers1

1

This is due to the tree-building process in the rpart algorithm. See here for an in-deep explanation with some real case study examples. However, the tree is built by the following process: first, the single variable is found which "best" splits the data into two groups. The data is separated, and then this process is applied separately to each sub-group, and so on recursively until the subgroups either reach a minimum size or until no improvement can be made. So this means that some variables can be excluded from the final model.

Furthermore,

The cp option of the summary function instructs it to prune the printout, but it does not prune the tree. For each node up to 5 surrogate splits (default) will be printed, but only those whose utility is greater than the baseline “go with the majority” surrogate.

I think this could explain the missing x3 predictor

Elia
  • 2,210
  • 1
  • 6
  • 18