2

I am puzzled by the following:

set.seed(144)
df = data.frame(outcome=as.factor(sample(c('a','b','c'), 1000, replace=T)), x=rnorm(1000), y=rnorm(1000), z=rnorm(1000))
library(rpart)
fit.default = rpart(outcome ~ x + y + z, data=df, method='class')
fit.specified = rpart(outcome ~ x + y + z, data=df, method='class', parms=list(split='gini', loss=matrix(c(0,1,1,1,0,1,1,1,0), nrow=3,ncol=3,byrow=T)))
fit.default$cptable
fit.specified$cptable

It produces different values in the xerror and xstd columns for the specified vs the default. But according to ?rpart the default split is 'gini' and the default loss matrix is the matrix of 1s (with zero diagonals) which I provided. So why would it behave differently? I noticed this because I was picking a different tree based on the minimum xerror and wanted to verify the baseline default case.

Palace Chan
  • 8,845
  • 11
  • 41
  • 93
  • 1
    The cross validation that happens in the first `rpart` call moves the seed along, which means that the next time it does cross validation in the next call it does it slightly differently. If you run them entirely separately you get the same values. – joran Aug 04 '15 at 21:23
  • @joran Ohhh I see. This makes sense. – Palace Chan Aug 04 '15 at 21:28

1 Answers1

3

Illustrating my comment above, if you run them completely disentangled:

set.seed(144)
df = data.frame(outcome=as.factor(sample(c('a','b','c'), 1000, replace=T)), 
                x=rnorm(1000), 
                y=rnorm(1000), 
                z=rnorm(1000))
library(rpart)
fit.default = rpart(outcome ~ x + y + z, 
                    data=df, 
                    method='class')
fit.default$cptable  

set.seed(144)
df = data.frame(outcome=as.factor(sample(c('a','b','c'), 1000, replace=T)), 
                x=rnorm(1000), 
                y=rnorm(1000), 
                z=rnorm(1000))
library(rpart)
fit.specified = rpart(outcome ~ x + y + z, 
                      data=df, 
                      method='class', 
                      parms=list(split='gini', 
                                loss=matrix(c(0,1,1,1,0,1,1,1,0), 
                                nrow=3,
                                ncol=3,
                                byrow=T)))

fit.specified$cptable

You get:

> fit.default$cptable
         CP nsplit rel error    xerror       xstd
1 0.0375000      0  1.000000 1.0000000 0.02371708
2 0.0140625      1  0.962500 0.9640625 0.02401939
3 0.0100000      3  0.934375 0.9921875 0.02378775

and

> fit.specified$cptable
         CP nsplit rel error    xerror       xstd
1 0.0375000      0  1.000000 1.0000000 0.02371708
2 0.0140625      1  0.962500 0.9640625 0.02401939
3 0.0100000      3  0.934375 0.9921875 0.02378775
joran
  • 169,992
  • 32
  • 429
  • 468