2

I try to implement RPART in order to make some developments later. So far only for regression (ANOVA) model. Everything seems pretty clean except one thing — how RPART selects best split among several predictors with identical improvement.

For example, I have three predictors for an initial split that give identical results (same improvement, same split, perfect surrogates to each other) — say X310, X312 and X317. RPART by default selects X312 but it is not the first predictor in the sequence of columns. If I permute the columns, RPART selects either X312 or X317 but never X310.

Here is an example of summary when it selects X312:

Node number 1: 100 observations, complexity param=0.7123717
mean=0.5155042, MSE=0.08350028
left son=2 (47 obs) right son=3 (53 obs)
Primary splits:
      X312 < 0.03673   to the left,  improve=0.7123717, (0 missing)
      X317 < 0.0187715 to the left,  improve=0.7123717, (0 missing)
      X310 < 0.0440585 to the left,  improve=0.7123717, (0 missing)
      X318 < 0.0167545 to the left,  improve=0.7123435, (0 missing)
      X323 < 0.0101715 to the left,  improve=0.7092180, (0 missing)

And when it selects X317:

Node number 1: 100 observations,    complexity param=0.7123717
  mean=0.5155042, MSE=0.08350028
  left son=2 (47 obs) right son=3 (53 obs)
  Primary splits:
      X317 < 0.0187715 to the left,  improve=0.7123717, (0 missing)
      X312 < 0.03673   to the left,  improve=0.7123717, (0 missing)
      X310 < 0.0440585 to the left,  improve=0.7123717, (0 missing)
      X318 < 0.0167545 to the left,  improve=0.7123435, (0 missing)
      X323 < 0.0101715 to the left,  improve=0.7092180, (0 missing)

One again everything is identical. I tried to look at the C code for RPART but could not find any additional checks. Will be very grateful for any ideas.

zx8754
  • 52,746
  • 12
  • 114
  • 209

0 Answers0