I try to implement RPART in order to make some developments later. So far only for regression (ANOVA) model. Everything seems pretty clean except one thing — how RPART selects best split among several predictors with identical improvement.
For example, I have three predictors for an initial split that give identical results (same improvement, same split, perfect surrogates to each other) — say X310
, X312
and X317
. RPART by default selects X312
but it is not the first predictor in the sequence of columns. If I permute the columns, RPART selects either X312
or X317
but never X310.
Here is an example of summary when it selects X312
:
Node number 1: 100 observations, complexity param=0.7123717
mean=0.5155042, MSE=0.08350028
left son=2 (47 obs) right son=3 (53 obs)
Primary splits:
X312 < 0.03673 to the left, improve=0.7123717, (0 missing)
X317 < 0.0187715 to the left, improve=0.7123717, (0 missing)
X310 < 0.0440585 to the left, improve=0.7123717, (0 missing)
X318 < 0.0167545 to the left, improve=0.7123435, (0 missing)
X323 < 0.0101715 to the left, improve=0.7092180, (0 missing)
And when it selects X317
:
Node number 1: 100 observations, complexity param=0.7123717
mean=0.5155042, MSE=0.08350028
left son=2 (47 obs) right son=3 (53 obs)
Primary splits:
X317 < 0.0187715 to the left, improve=0.7123717, (0 missing)
X312 < 0.03673 to the left, improve=0.7123717, (0 missing)
X310 < 0.0440585 to the left, improve=0.7123717, (0 missing)
X318 < 0.0167545 to the left, improve=0.7123435, (0 missing)
X323 < 0.0101715 to the left, improve=0.7092180, (0 missing)
One again everything is identical. I tried to look at the C code for RPART but could not find any additional checks. Will be very grateful for any ideas.