0

I am new with WEKA.

In my dataset, i have an attribute where the type is numeric. In the dataset, there are specific values being represented as 'missing value' and 'not applicable'.

For example

0- missing values 99999 - represents not applicable

For 'missing values', i can represent it using '?', but how about for 'Not Applicable'?

My question are :- 1) how can we tell WEKA not to include 'Not Applicable' value in calculating the mean or std dev? 2) How 'Not Applicable' value effect the classification result?

Thank you.

1 Answers1

0

This might actually be a question better suited for stats.stackexchange.com, though I acknowledge that this is a WEKA-specific question. Now, there might be models in WEKA that handle the problem of missing values well. I don't know WEKA, but I there might be decision tree implementations that handle this gracefully for you.

However, you might want to make a couple of more basic considerations first, as missing feature values is a difficult problem. These considerations would have to be made by any automatic functionality in WEKA anyway, so it is probably better to do them beforehand using your domain knowledge..

'Not Applicable' is one of the ways for the feature to be missing. So there may or may not be a distinction between 'missing' and 'not applicable', depending upon your dataset. In calling a value "missing", you are merely saying you do not have the value. Why is it missing?

There are many potential causes for missingness in a feature, some more detrimental than others. In this situation there is mainly three options:

  1. Delete all records which have a missing value
  2. Remove any feature that has a missing value
  3. Replace any missing value with some "guess" at what the value should be. This is called imputation.

The most conservative and safest choice clearly is to simply drop the feature. In doing this, it would be useful to create an extra indicator feature, which can simply indicate whether or no the original feature was missing. This information might be useful in fitting a good model.

In choosing which one of these three approaches to take, there are a couple of things to consider.

  • Do you know for sure that 99999 is generated from an explicit NA-decision, and not by the same mechanism as 0? By what mechanism is the zeros generated, since you merely describe them as "misssing"?
  • How common are these feature values indicating missing value? The more missing feature values, the riskier case deletion or feature imputation becomes.
  • If you believe there is value in imputation, can your domain knowledge help you in choosing the suitable values? For instance, if a value is entered only when it deviates from some value (let's say high blood pressure), and left blank when it lies at the expected level, imputing this value in the missing cases would be reasonable.
Alex A.
  • 2,646
  • 22
  • 36