How does the C4.5 algorithm deal with missing values and attribute value on continuous interval? Also, how is a decision tree pruned? Could someone please explain with the help of an example.
Asked
Active
Viewed 3,641 times
0
-
https://stats.stackexchange.com/questions/96025/how-do-decision-tree-learning-algorithms-deal-with-missing-values-under-the-hoo – James LT Oct 18 '18 at 20:48
1 Answers
2
Say we built a decision tree from the canonical example of whether one should play golf based on the weather conditions. We may have a training dataset like this:
OUTLOOK | TEMPERATURE | HUMIDITY | WINDY | PLAY
=====================================================
sunny | 85 | 85 | false | Don't Play
sunny | 80 | 90 | true | Don't Play
overcast| 83 | 78 | false | Play
rain | 70 | 96 | false | Play
rain | 68 | 80 | false | Play
rain | 65 | 70 | true | Don't Play
overcast| 64 | 65 | true | Play
sunny | 72 | 95 | false | Don't Play
sunny | 69 | 70 | false | Play
rain | 75 | 80 | false | Play
sunny | 75 | 70 | true | Play
overcast| 72 | 90 | true | Play
overcast| 81 | 75 | false | Play
rain | 71 | 80 | true | Don't Play
And use it to build a decision tree that may look something like this:
Outlook
/ | \
overcast / |sunny \rain
/ | \
Play Humidity Windy
/ | | \
/ | | \
<=75 / >75| true| \false
/ | | \
Play Don'tPlay Don'tPlay Play
- The C4.5 Algorithm deals with missing values by returning the probability distribution of the labels under the attribute branch for which the value is missing. Suppose that we had an instance in our test data that showed the outlook to be
Sunny
but did not have a value for the attributeHumidity
. Also, suppose that our training data had 2 instances for which the outlook wasSunny
,Humidity
was below 75, and a label ofPlay
. Furthermore, suppose the training data had 3 instances where the outlook wasSunny
,Humidity
was above 75, and had a label ofDon't Play
. So for the test instance with the missingHumidity
attribute, the C4.5 algorithm would return a probability distribution of[0.4, 0.6]
corresponding to[Play, Don't Play]
. - Assuming that you already understand how decision trees use information gain over a set of features to choose which features to branch at on each level, the C4.5 algorithm performs this same procedure on a continuous interval attribute by evaluating the information gain for every split of the attribute and choosing the best one. An example of this can be seen in the
Humidity
attribute above. The C4.5 algorithm tested the information gain provided by the humidity attribute by splitting it at 65, 70, 75, 78...90 and found that performing the split at 75 provided the most information gain. - C4.5 performs pruning by replacing a subtree in the decision tree with a single decision node that either encompasses all the decisions of the subtree or provides the least error.
For more information, I would suggest this excellent resource I used to write my own Decision Tree and Random Forest algorithm: https://cis.temple.edu/~giorgio/cis587/readings/id3-c45.html

Chirag
- 446
- 2
- 14
-
-
A question from this "by splitting it at 65, 70, 75, 78...90". What about the 95 and 96? – Calvin Oct 24 '18 at 16:09
-
1I meant to say all of the values were evaluated and the 75 values was chosen. – Chirag Oct 24 '18 at 16:28
-
I am learning C4.5 and confused in this, in a journal on Improved Use of Continuous Attributes in C4.5 by Quinlan on page 79, it says that, if there are N distinct values of A in set of cases D, there will be N-1 thresholds that could be used for a test on A, is it the same as choosing threshold like your answer? like this [65] 70 where 65 is the thresold,if it is. then the threshold is N? then its different from journal which was N-1 – Calvin Oct 25 '18 at 15:28
-
1Oh, it is N-1 because the last wont be evaluated as threshold because it will be <= threshold for D1 and >threshold for D2, D1 would have all the cases, and D2 would be empty. – Calvin Oct 26 '18 at 15:47
-