0

I've made an analysis of a dataset i have consisting on 266 istances and about 100 indicators on that using j48 tree in R. I'm not the most skilled in machine learning, anyway i managed to get the j48 tree in both Weka and R. In the latter i found that the tree could be visualized trough partykit package. However, i find difficult to interpret the results i have, that are these (X, Y and Z are 3 of 100+ indicators i use to describe each of the 266 istances, of which 190 are normal or 0 and 76 are abnormal or 1). J48 pruned tree

The code i used is very easy:

m1 <- J48(Case~., data = mydata, control = Weka_control(R = TRUE))
if(require("partykit", quietly = TRUE)) plot(m1)

thus i've pruned the tree. One more question: i've understood i may obtain the fitted values from the tree, but i dont know how. Any help on both or just one question will be appreciated.

Ciochi
  • 43
  • 7

2 Answers2

1

The variables X, Y, Z have been selected to split (or partition) your data while the remaining variables have not been selected. The resulting terminal nodes thus lead to different probabilities for the response. The response probabilities are also displayed by the stacked bar plots in the terminal nodes of the visualization.

For example, if X <= 34, then the response probability is rather low (around 17%). This is the largest subset with 193 of the 266 observations. The only subset for which the reponse probability is very high (around 96%) are the 35 observations with X > 34 & Y <= 482 & Z > 451.

As already pointed out by @Roman Luštrik: The fitted values for each observation can be obtained by predict(m1, type = "prob").

Achim Zeileis
  • 15,710
  • 1
  • 39
  • 49
  • Okay, but what does this mean? If i make summary(m1) i get that 83% of istances are correctly identified. – Ciochi Nov 01 '15 at 23:00
  • Yes, nodes 2, 5, 7 would be classified as 0 responses and 6 as a 1 response (if you do simple majority voting). And as the barplots show, you will have some misclassifications in each node, especially node 5. – Achim Zeileis Nov 01 '15 at 23:12
  • I may have understood now, just check if i'm right. I get 83% of istances correctly identified, thus this means that nearly 46 out of the total 266 are misidentified. If i check in each node, i get that: - node 2 should be 0s response, but nearly 18% out of 193 is misidentified, that means roughly 35; - node 5 should be 0s, but nearly 42% out of 21 are misidentified, thus means roughly 9; - no misidentification of 1s responses in node 6; - node 7 again 0s, about 22% out of 17, thus means 3; summing all makes 47, considering the rounding i made, that should be fine, isnt it? – Ciochi Nov 01 '15 at 23:33
  • Yes (up to rounding etc.). Using further splits (e.g. switching off pruning) you might improve the in-sample misclassification rate - but this might perform worse out of sample on new data. – Achim Zeileis Nov 01 '15 at 23:43
  • Okay,now how can i use this model i've created to classify new data? Thanks a lot for your help. – Ciochi Nov 02 '15 at 00:44
  • You're welcome. Please also accept the answer if the problem is resolved. – Achim Zeileis Nov 02 '15 at 07:07
0

A general R way of obtaining fitted values is through function predict. In your case, you are interested in probabilities of classification. See ?predict.Weka_classifier for more info.

library(RWeka)

m1 <- J48(Species ~ ., data = iris)
predict(m1, type = "probability")
    setosa versicolor  virginica
1        1 0.00000000 0.00000000
2        1 0.00000000 0.00000000
3        1 0.00000000 0.00000000
4        1 0.00000000 0.00000000
5        1 0.00000000 0.00000000
6        1 0.00000000 0.00000000
Roman Luštrik
  • 69,533
  • 24
  • 154
  • 197