4

How do I incorporate weights into the minsplit criteria in rpart, when the weights are uneven? I could not find a way for the minsplit threshold to take the weights into account, and when the weights are uneven it becomes an issue, as the following example shows. My current workaround is to expand the data into one in which each row is an observation, but that seems wasteful in both time and memory (and I doubt I can keep the real datasets I need to work with in memory in their expanded form anyway), thus - turning for help. Thanks in advance for your help, -Saar

The following code shows what the issue is; the first 3 trees are the same, but the following two (with uneven weights) turn out differently:

## playing with rpart weights
require(rpart)
dev.new()
par(mfrow=c(2,3), xpd=NA) 
data(kyphosis)

fitOriginal <- rpart(Kyphosis ~ Age + Number + Start, data=kyphosis, control=rpart.control(minsplit=15))
plot(fitOriginal)
text(fitOriginal, use.n=TRUE)

# this dataset is the original data repeated 3 times
kyphosisRepeated <- rbind(kyphosis, kyphosis, kyphosis)
fitRepeated <- rpart(Kyphosis ~ Age + Number + Start, data=kyphosisRepeated, control=rpart.control(minsplit=45))
plot(fitRepeated)
text(fitRepeated, use.n=TRUE)

# instead of repeating, use weights
kyphosisWeighted <- kyphosis
kyphosisWeighted$myWeights <- 3
fitWeighted <- rpart(Kyphosis ~ Age + Number + Start, data=kyphosisWeighted, weights=myWeights, 
    control=rpart.control(minsplit=15))        ## minsplit has to be adjusted for weights...
plot(fitWeighted)
text(fitWeighted, use.n=TRUE)

# uneven weights don't works the same way
kyphosisUnevenWeights <- rbind(kyphosis, kyphosis)
kyphosisUnevenWeights$myWeights <- c(rep(1,length.out=nrow(kyphosis)), rep(2,length.out=nrow(kyphosis)))

fitUneven15 <- rpart(Kyphosis ~ Age + Number + Start, data=kyphosisUnevenWeights, weights=myWeights, 
    control=rpart.control(minsplit=15))
plot(fitUneven15)
text(fitUneven15, use.n=TRUE)

fitUneven45 <- rpart(Kyphosis ~ Age + Number + Start, data=kyphosisUnevenWeights, weights=myWeights, 
    control=rpart.control(minsplit=45))
plot(fitUneven45)
text(fitUneven45, use.n=TRUE)

## 30 works, but seems like a special case 
fitUneven30 <- rpart(Kyphosis ~ Age + Number + Start, data=kyphosisUnevenWeights, weights=myWeights, 
    control=rpart.control(minsplit=30))
plot(fitUneven30)
text(fitUneven30, use.n=TRUE)
rcs
  • 67,191
  • 22
  • 172
  • 153
Saar
  • 66
  • 1
  • 5

1 Answers1

0

There is no issue here. If you use a dataset twice as large as the original dataset and then require minsplit to be 3 times as large as your original minsplit, of course you're going to grow a shorter tree (assuming the relativities amongst the weights remain the same.) See these revised examples which show that you will grow identical identical trees if you keep weight relativities the same, and the ratio of minsplit/n the same too.

## playing with rpart weights
require(rpart)
dev.new()
par(mfrow=c(2,2), xpd=NA) 
data(kyphosis)

# this dataset is the original data repeated 2 times############################################################
# without weights
kyphosisRepeated <- rbind(kyphosis, kyphosis)
fitRepeated <- rpart(Kyphosis ~ Age + Number + Start, data=kyphosisRepeated, control=rpart.control(minsplit=30))
plot(fitRepeated)
text(fitRepeated, use.n=TRUE)

# with weights
kyphosisUnevenWeights <- rbind(kyphosis, kyphosis)
kyphosisUnevenWeights$myWeights <- c(rep(1,length.out=nrow(kyphosis)), rep(2,length.out=nrow(kyphosis)))

fitUneven30 <- rpart(Kyphosis ~ Age + Number + Start, data=kyphosisUnevenWeights, weights=myWeights, 
                     control=rpart.control(minsplit=30))
plot(fitUneven30)
text(fitUneven30, use.n=TRUE)
################################################################################################################

# this dataset is the original data repeated 3 times
# without weights
kyphosisRepeated <- rbind(kyphosis, kyphosis, kyphosis)
fitRepeated <- rpart(Kyphosis ~ Age + Number + Start, data=kyphosisRepeated, control=rpart.control(minsplit=45))
plot(fitRepeated)
text(fitRepeated, use.n=TRUE)

# with weights
kyphosisUnevenWeights <- rbind(kyphosis, kyphosis, kyphosis)
kyphosisUnevenWeights$myWeights <- c(rep(1,length.out=nrow(kyphosis)), rep(2,length.out=nrow(kyphosis)), rep(3,length.out=nrow(kyphosis)))

fitUneven45 <- rpart(Kyphosis ~ Age + Number + Start, data=kyphosisUnevenWeights, weights=myWeights, 
                     control=rpart.control(minsplit=45))
plot(fitUneven45)
text(fitUneven45, use.n=TRUE)

See this blog post for more details on RPart.

Ben
  • 20,038
  • 30
  • 112
  • 189
  • 1
    I'm trying to work with unbalanced weights and minsplit on a general dataset, and the example shows it doesn't work. Balancing the weights is not a general solution, it may result in a dataset that's too big. – Saar Sep 26 '14 at 03:18
  • @ Saar, I apologize If I'm missing something obvious. You say that the examples show "it doesn't work". In what way doesn't it work? When I tested the examples, a tree grew in each of them without any errors. Did one of the trees grow in a way you didn't expect? – Ben Sep 26 '14 at 05:22
  • In all 6 examples the data is the same data, represented in different ways (except the first example): It's either each observation repeated three times, appear once but have the weight of 3, or appear twice with weights that add up to 3. I would expect the trees that get built out of it to be the same tree (same data, same algorithm, same conditions should lead to the same outputs). Specifically, the fifth example should give me the same tree as the second and third example. It doesn't. This isn't about run time errors, it's about getting the wrong answers back... – Saar Sep 26 '14 at 11:18