2

How will classifiers (such as decision trees) in Weka interpret '?' (that stands for missing values in ARFF files) during learning stage? Will Weka just replace it with some predefined value (e.g. '0' or 'false') or will it somehow affect the training process?

michaeltwofish
  • 4,096
  • 3
  • 28
  • 32
om-nom-nom
  • 62,329
  • 13
  • 183
  • 228

1 Answers1

9

Apart from treating missing value as an attribute value on its own, in the case of the J48 classifier any split on an attribute with missing value will be done with weights proportional to frequencies of the observed non-missing values. This is documented in Witten and Frank's textbook, Data Mining Practical Machine Learning Tools and Techniques (2005, 2nd. ed., p. 63 and p. 191), who then reported that

eventually, the various parts of the instance will each reach a leaf node, and the decisions at these leaf nodes must be recombined using the weights that have percolated to the leaves.

More information about handling missing values in decision trees, like surrogate splits in CART (and contrary to C4.5 or its successor J48), can be found on the wiki section for Classification Trees; the use of imputation is also discussed in several articles, e.g. Handling missing data in trees: surrogate splits or statistical imputation.

Matt S.
  • 878
  • 10
  • 21
chl
  • 27,771
  • 5
  • 51
  • 71