0

I have a dataset with profit as the target variable and about 100 different predictor variables (some binary, some continous and some character).

Is there a decision tree package in R that can be used that will give buckets (or end nodes) where profit is maximised (and preferably >0)?

Currently i have been using ctree from partykit package. The trees that are split invariably give good splits on the predictor variables, but the end-nodes invariably result in negative profit.

I am also having difficulty understanding the results at node end. These tend to be 'N=' and 'Error='. Is there some way to get 'profit=' instead so you can see what the best end node is?

Many thanks,

Tammboy
  • 321
  • 4
  • 14

1 Answers1

0

Simple things first: Understanding the printed results in each terminal node. Consider the following simple (and not particularly useful) ctree that models how the stopping distance of cars depends on their speed:

library("partykit")
ctree(dist ~ speed, data = cars)
## 
## Model formula:
## dist ~ speed
## 
## Fitted party:
## [1] root
## |   [2] speed <= 17
## |   |   [3] speed <= 12: 18.200 (n = 15, err = 1176.4)
## |   |   [4] speed > 12: 39.750 (n = 16, err = 3535.0)
## |   [5] speed > 17: 65.263 (n = 19, err = 9015.7)
## 
## Number of inner nodes:    2
## Number of terminal nodes: 3

This means for example that in node 5 there are 19 observations with speed > 17 whose average stopping distance was 65.263 corresponding to an error sum of squares of 9015.7.

Thus, the mean of the target variable is given first (before the n and err) and is what you will be most interested in. To maximize the target variable you can then choose the terminal node with the highest predicted mean.

Finally, I wouldn't know of a tree method that is dedicated to profit maximization directly. The standard tree methods try to find terminal nodes that are homogenous in a certain way. (here with approximately constant average target value).

Achim Zeileis
  • 15,710
  • 1
  • 39
  • 49
  • Ideally, I want a scenario where, when a particular variable of say, 5 attributes, is split. The decision tree sums up the profit in each permutation and splits the variable into two buckets where the difference in the summed profit is greatest. Is there any function that does this? – Tammboy Sep 15 '16 at 16:56
  • And then loops through each variable to find the variable with the greatest profit differentiation.This would then become node1... – Tammboy Sep 15 '16 at 16:57
  • This would need to be formulated more precisely for a reliable answer. It may be that standard regression trees are very close to what you want to do - but it might also be that it is very different. At the moment I'm not sure. – Achim Zeileis Sep 15 '16 at 19:57
  • 1
    Here is some example data with profit being the dependent variable: – Tammboy Sep 15 '16 at 20:42