How will classifiers (such as decision trees) in Weka interpret '?' (that stands for missing values in ARFF files) during learning stage? Will Weka just replace it with some predefined value (e.g. '0' or 'false') or will it somehow affect the training process?
1 Answers
Apart from treating missing value as an attribute value on its own, in the case of the J48 classifier any split on an attribute with missing value will be done with weights proportional to frequencies of the observed non-missing values. This is documented in Witten and Frank's textbook, Data Mining Practical Machine Learning Tools and Techniques (2005, 2nd. ed., p. 63 and p. 191), who then reported that
eventually, the various parts of the instance will each reach a leaf node, and the decisions at these leaf nodes must be recombined using the weights that have percolated to the leaves.
More information about handling missing values in decision trees, like surrogate splits in CART (and contrary to C4.5 or its successor J48), can be found on the wiki section for Classification Trees; the use of imputation is also discussed in several articles, e.g. Handling missing data in trees: surrogate splits or statistical imputation.
-
Thanks, that's exactly what i've wanted to know. – om-nom-nom May 18 '11 at 05:59
-
So what is the exact answer for this? – London guy Jul 20 '12 at 19:25
-
@AbhishekShivkumar The 2nd blind downvote I received today doesn't let me see how my answer could be improved. Of course, I realize that this doesn't help much in answering your question :-) – chl Jul 20 '12 at 19:53