0

I've been learning R over the past couple of years and am attempting to create a regression tree for an analysis at work. I have a dataset of roughly 750K records, a target variable that's numeric, and a corresponding vector of weights. There are no nulls in the target or the weights. When I run rpart(), I get a root node only. So I fiddled with the control parameters and developed a tree with 14 nodes and a depth of 10. I'd really like to get a tree that's a bit simpler and smaller than this. If I raise the cp parameter (I went down to 0.0001 from the default of 0.01 to get a non-root-only result), or raise the minbucket or minsplit parameters, that 14-node tree collapses down to the root only. Shouldn't I be able to get something in between 14 nodes and only one?

I tried using the tree() package and while I'm not quite as familiar with the parameter manipulation in that package, I seem to be running into the same issue, resulting in a "singlenode" tree.

Everything I come across online suggests, at least to me, that I should be able to snip away at the tree until I get something that's the size that I want. While the 14-node result isn't the "bushiest" tree in the world, it doesn't seem like it's overly "stringy" such that it would collapse down with such small tweaks attempting to tighten it up. I've included a picture of the result to illustrate the structure.

My 14-node tree

  • Still working on this and have resorted to manually pruning by converting the node paths (truncated versions of them actually - this is the manual pruning process) to coded rules and binning my data from there. Tedious but it's working. Where this approach now hurts is when I try taking a different sample of my data for the sake of training the tree model, then I have to recode a different set of node paths. Any help would be appreciated! – Jason Culp Jul 18 '16 at 17:19

1 Answers1

0

This is a comment only - but I don't have enough rep to post a comment only.

Firstly - it would be great if you could post a reproducible example - though I guess that might be difficult given that the behaviour you are seeing will depend on your data.

Secondly, did you try looking at the "cptable" part of you tree. For example if your tree is called rpart_1, you can look at cptable by typing rpart_1$cptable. Here is an example of a cptable:

       CP    nsplit rel error    xerror       xstd
1 0.17045881      0 1.0000000 1.0001723 0.01374676
2 0.05035021      1 0.8295412 0.8299854 0.01125953
3 0.01888694      2 0.7791910 0.7865432 0.01199921
4 0.01177287      4 0.7414171 0.7446313 0.01251485
5 0.01000000      5 0.7296442 0.7362431 0.01248352

Without going into detail, the CP column tells you which value of cp will prune the tree to get a certain number of nodes. If you look at the cptable for your tree, do you see something like this or does it only have two lines?

Alan Chalk
  • 300
  • 2
  • 8
  • I did look at this table actually and it does only have two lines, jumping straight from the 14-node version to the root. – Jason Culp Jul 20 '16 at 17:59
  • OK. Can you post the $splits table of your rpart object? That may give some hint of what is going on... – Alan Chalk Jul 21 '16 at 18:16
  • I'll plan to do this, but right now I'm working on it and came across something interesting. Previously I had been binning all of my predictor variables in some way. Most had a cardinality of two or three, with just a few having more. I tried loosening up the model by removing a good deal of this binning. While it now takes longer to fit, which makes sense, I can now prune the results much more flexibly as I originally expected to have the ability to do. – Jason Culp Jul 22 '16 at 17:03
  • OK. Would be interesting to see the $splits on the original model when you are done, so we can get to the bottom of why it would not prune nicely. btw you need to be careful if you bin into factors with lots of levels, though that is another story. – Alan Chalk Jul 23 '16 at 21:40