0

in Weka there is a filter called "ReplaceMissingValues" that permit to replace all missing values in a dataset using the mean of each attribute. I'd like to replace missing values, for a certain attribute, using the mean of values that belong to a certain class. For example in a binary dataset I think that is more correct to replace a missing value for an attribute in record that belong to the positive class using the mean calculated with only the records that belong to the positive class. So how is possible to realized it? How can we replace values only for record that belong to a certain class?

Rushdi Shams
  • 2,423
  • 19
  • 31
Titus Pullo
  • 3,751
  • 15
  • 45
  • 65

1 Answers1

1

If you want to replace missing values of Class A by taking the mean calculated from the training instances of that particular class A, then you are "bias"ing your dataset. To avoid bias (which eventually will overfit your trained model), it is wise to use the default "replace missing values" function- i.e., to consider mean and mode of all training instances rather than of just that particular class.

Rushdi Shams
  • 2,423
  • 19
  • 31
  • I'm working on a medical dataset so I tought it could be more "realistic" to replace with the mean of the Class that the record belong to. I'll try also with the replace missing values function too but I'd like to able to try also my idea without modify original data files (an xls file!) – Titus Pullo Apr 23 '12 at 16:52
  • As I said, you have a high possibility of overfitting your trained model. Because, in real life, the unseen data can more likely have a feature X whose values are not actually mean of a particular class. If you train your model with this set-up then, the model will only learn that "the values of a feature X are actually somewhat closer to the mean of any particular class A"- and if it is not the case, then definitely you have your model overfitted. – Rushdi Shams Apr 23 '12 at 18:18
  • I tried with your suggestion (using a tree built with J48) and I got worst result than using missing values! How is this possible? – Titus Pullo Apr 24 '12 at 09:37
  • I don't know why this happens. But I believe there is plenty of research on the effect of missing values over decision trees. http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=4063648&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D4063648 – Rushdi Shams May 01 '12 at 04:51
  • 1
    J48 handles missing values by splitting the samples according to the existing values frequencies. When you replaced the missing values before you run the tree, you actually overridden the J48 handling of missing values. This *might* be why you got worse results – daramasala Apr 24 '13 at 08:02