5

I am using the rpart package like so:

model <- rpart(totalUSD ~ ., data = df.train)

I notice that over 80k rows, rpart is generalizing it's predictions into just three distinct groups as shown in the image below:

enter image description here

I see several configuration options for the rpart method; however, I don't quite understand them.

Is there a way to configure rpart so that it creates more predictions (instead of just three); not such stark groups, but more levels in between?

The reason I ask is because my cost estimator is looking rather simplistic being as it only returns one of three numbers!

Here is an example of my data:

structure(list(totalUSD = c(9726.6, 730.14, 750, 200, 60.49, 
310.81, 151.23, 145.5, 3588.13, 400), durationDays = c(730, 724, 
730, 189, 364, 364, 364, 176, 730, 1095), familySize = c(4, 1, 
2, 1, 3, 2, 1, 1, 4, 4), serviceName = c("Service5", 
"Service6", "Service9", "Service4", 
"Service1", "Service2", "Service1", "Service3", 
"Service7", "Service8"), homeLocationGeoLat = c(37.09024, 
10.691803, 37.09024, 35.86166, 55.378051, 35.86166, 51.165691, 
-30.559482, -30.559482, 41.87194), homeLocationGeoLng = c(-95.712891, 
-61.222503, -95.712891, 104.195397, -3.435973, 104.195397, 10.451526, 
22.937506, 22.937506, 12.56738), hostLocationGeoLat = c(55.378051, 
37.09024, 55.378051, 55.378051, 37.09024, 1.352083, 55.378051, 
37.09024, 23.424076, 1.352083), hostLocationGeoLng = c(-3.435973, 
-95.712891, -3.435973, -3.435973, -95.712891, 103.819836, -3.435973, 
-95.712891, 53.847818, 103.819836), geoDistance = c(6838055.10555534, 
4532586.82063172, 6838055.10555534, 7788275.0443749, 6838055.10555534, 
3841784.48282769, 1034141.95021832, 14414898.8246973, 6856033.00945242, 
10022083.1525388)), .Names = c("totalUSD", "durationDays", "familySize", 
"serviceName", "homeLocationGeoLat", "homeLocationGeoLng", "hostLocationGeoLat", 
"hostLocationGeoLng", "geoDistance"), row.names = c(25601L, 6083L, 
24220L, 20235L, 8372L, 456L, 8733L, 27257L, 15928L, 24099L), class = "data.frame")
user1477388
  • 20,790
  • 32
  • 144
  • 264
  • can you give us a sample of data or some reproducible example? – roman Jul 29 '15 at 09:55
  • Yes, I have added it to my question. Thank you. – user1477388 Jul 29 '15 at 13:04
  • 1
    Ok I had a little play with your data. It's hard to re-create your problem as the tree you are building uses much more data. I am guessing that the parameters are set such that you have two splits in your tree (two explanatory variables of importance), resulting in 3 terminal nodes. The tree predicts the mean in the regions at each terminal node. If you want more fine-scale predictions, try something like random forests or boosting rather than fitting a single tree. Then you can use cross validation to tune the number of averaged trees (random forests) or the shrinkage parameter (boosting). – roman Jul 30 '15 at 14:24
  • 1
    i highly recommend chapter 8 of this lecture series; http://www.r-bloggers.com/in-depth-introduction-to-machine-learning-in-15-hours-of-expert-videos/ that should see you well on your way to bossing decision trees. – roman Jul 30 '15 at 14:25
  • 2
    you can tune some of the parameters to get more splits, particularly these `rpart(..., control = list(minsplit = 20, minbucket = 20/3, cp = .01))` – rawr Jul 30 '15 at 15:41
  • @rawr can you post an answer describing what each parameter would do and what I would expect. For example, would minsplit = 20 provide 20 different predictions? – user1477388 Jul 30 '15 at 17:05
  • For the discussion, I will also include the documentation. However, some examples would be nice, too! https://stat.ethz.ch/R-manual/R-devel/library/rpart/html/rpart.control.html – user1477388 Aug 02 '15 at 13:40
  • @rawr I tried changing to `model <- rpart(totalUSD ~ ., data = df.train, control = list(minsplit = 20, minbucket = 20/3, cp = .01))` but I still only get three levels of predictions exactly like the image above describes. – user1477388 Aug 05 '15 at 19:37
  • @user1477388 well those are the default values so I would expect that, see `?rpart.control` – rawr Aug 05 '15 at 19:47
  • @rawr I also tried doubling it but still no difference was observed. I am going to try random forest. – user1477388 Aug 06 '15 at 13:18
  • you have to *lower* it. it currently says "nodes must have 20 to consider splitting and if you do split, each new leaf must have at least 20/3 observations." you should really read `?rpart.control`. the cp is more specific. since you have 80k observations and only three splits, that is pretty compelling. – rawr Aug 06 '15 at 13:27
  • @rawr I tied reading it but it's difficult for me to understand. I will try again. Since it was hard to understand, I asked for a more comprehensive example if anyone had the time to provide one. – user1477388 Aug 06 '15 at 15:21
  • rpart has [two vignettes](https://github.com/cran/rpart/tree/master/inst/doc) which might help – rawr Aug 06 '15 at 16:34

1 Answers1

3

If you really want a complex tree structure, try this:

library(rpart)
fit = rpart(totalUSD ~ ., data = df.train, control = rpart.control(cp = 0))

Basically every split is tried when cp = 0, regardless of improvements in splits. You get a really complex model with this, but you have 80k observations, so set minsplit or minbucket to a number you are comfortable with.

I used this strategy in a random forest implementation I'm working in. Beware computation time might increase a lot.

catastrophic-failure
  • 3,759
  • 1
  • 24
  • 43