c4.5 algorithm missing values

Question

How does the C4.5 algorithm deal with missing values and attribute value on continuous interval? Also, how is a decision tree pruned? Could someone please explain with the help of an example.

https://stats.stackexchange.com/questions/96025/how-do-decision-tree-learning-algorithms-deal-with-missing-values-under-the-hoo — James LT, Oct 18 '18 at 20:48

score 2 · Answer 1 · answered Feb 27 '17 at 04:28

Say we built a decision tree from the canonical example of whether one should play golf based on the weather conditions. We may have a training dataset like this:

OUTLOOK | TEMPERATURE | HUMIDITY | WINDY | PLAY
=====================================================
sunny   |      85     |    85    | false | Don't Play
sunny   |      80     |    90    | true  | Don't Play
overcast|      83     |    78    | false | Play
rain    |      70     |    96    | false | Play
rain    |      68     |    80    | false | Play
rain    |      65     |    70    | true  | Don't Play
overcast|      64     |    65    | true  | Play
sunny   |      72     |    95    | false | Don't Play
sunny   |      69     |    70    | false | Play
rain    |      75     |    80    | false | Play
sunny   |      75     |    70    | true  | Play
overcast|      72     |    90    | true  | Play
overcast|      81     |    75    | false | Play
rain    |      71     |    80    | true  | Don't Play

And use it to build a decision tree that may look something like this:

              Outlook
            /  |      \
  overcast /   |sunny  \rain
          /    |        \
     Play   Humidity   Windy
           /   |         |  \
          /    |         |   \
    <=75 /  >75|     true|    \false
        /      |         |     \
     Play  Don'tPlay Don'tPlay  Play

The C4.5 Algorithm deals with missing values by returning the probability distribution of the labels under the attribute branch for which the value is missing. Suppose that we had an instance in our test data that showed the outlook to be Sunny but did not have a value for the attribute Humidity. Also, suppose that our training data had 2 instances for which the outlook was Sunny, Humidity was below 75, and a label of Play. Furthermore, suppose the training data had 3 instances where the outlook was Sunny, Humidity was above 75, and had a label of Don't Play. So for the test instance with the missing Humidity attribute, the C4.5 algorithm would return a probability distribution of [0.4, 0.6] corresponding to [Play, Don't Play].
Assuming that you already understand how decision trees use information gain over a set of features to choose which features to branch at on each level, the C4.5 algorithm performs this same procedure on a continuous interval attribute by evaluating the information gain for every split of the attribute and choosing the best one. An example of this can be seen in the Humidity attribute above. The C4.5 algorithm tested the information gain provided by the humidity attribute by splitting it at 65, 70, 75, 78...90 and found that performing the split at 75 provided the most information gain.
C4.5 performs pruning by replacing a subtree in the decision tree with a single decision node that either encompasses all the decisions of the subtree or provides the least error.

For more information, I would suggest this excellent resource I used to write my own Decision Tree and Random Forest algorithm: https://cis.temple.edu/~giorgio/cis587/readings/id3-c45.html

A question from this "by splitting it at 65, 70, 75, 78...90". What about the 95 and 96? — Calvin, Oct 24 '18 at 16:09
I meant to say all of the values were evaluated and the 75 values was chosen. — Chirag, Oct 24 '18 at 16:28
I am learning C4.5 and confused in this, in a journal on Improved Use of Continuous Attributes in C4.5 by Quinlan on page 79, it says that, if there are N distinct values of A in set of cases D, there will be N-1 thresholds that could be used for a test on A, is it the same as choosing threshold like your answer? like this [65] 70 where 65 is the thresold,if it is. then the threshold is N? then its different from journal which was N-1 — Calvin, Oct 25 '18 at 15:28
Oh, it is N-1 because the last wont be evaluated as threshold because it will be <= threshold for D1 and >threshold for D2, D1 would have all the cases, and D2 would be empty. — Calvin, Oct 26 '18 at 15:47

c4.5 algorithm missing values

1 Answers1