Weka: ReplaceMissingValues for a test file

Question

I am a bit worried when using Weka's ReplaceMissingValues to input the missing values only for the test arff dataset but not for the training dataset. Below is the commandline:

java -classpath weka.jar weka.filters.unsupervised.attribute.ReplaceMissingValues -c last  -i "test_file_with_missing_values.arff" -o "test_file_with_filled_missing_values.arff"

From a previous post (Replace missing values with mean (Weka)), I came to know that Weka's ReplaceMissingValues simply replace each missing value with the mean of the corresponding attribute. This implies that the mean needs to be computed for each attribute. While computation of this mean is perfectly fine for the training file, it is not okay for the test file.

This is because in the typical test scenario, we should not assume that we know the mean of the test attribute for the input missing values. We only have one test record with multiple attributes for classification instead of having the entire set of test records in a test file. Therefore, instead, we shall input the missing value based on the mean computed using the training data. Then above command would become incorrect as we would need to have another input (the means of the train attributes).

Has anybody thought about this before? How do you work around this by using weka?

Sentry · Answer 1 · 2013-03-16T09:19:45.253

Easy, see Batch Filtering

Instances train = ...   // from somewhere
Instances test = ...    // from somewhere
Standardize filter = new Standardize();
filter.setInputFormat(train);  // initializing the filter once with training set
Instances newTrain = Filter.useFilter(train, filter);  // configures the Filter based on train instances and returns filtered instances
Instances newTest = Filter.useFilter(test, filter);    // create new test set

The filter is initialized using the training data and then applied on both training and test data.

The problem is when you apply the ReplaceMissingValue filter outside any processing pipeline, because after writing the filtered data, you can't distinguish between "real" values and "imputed" values anymore. This is why you should do everything that needs to be done in a single pipeline, e.g., using the FilteredClassifier:

java -classpath weka.jar weka.classifiers.meta.FilteredClassifier
-t "training_file_with_missing_values.arff"
-T "test_file_with_missing_values.arff"
-F weka.filters.unsupervised.attribute.ReplaceMissingValues
-W weka.classifiers.functions.MultilayerPerceptron -- -L 0.3 -M 0.2 -H a

This example will initialize the ReplaceMissingValues filter using the "training_file_with_missing_values.arff" data set, then apply the filter on "test_file_with_missing_values.arff" (with the means learning from the training set), then train a multilayer perceptron on the filtered training data and predict the class of the filtered test data.

Thanks, a few questions though. That looks like it can work but I am new to Weka and even more so weka on the command line... Wouldn't I also need to run StringToWordVector as well? How can I add a second filter into the same command? Also "Illegal option -c last" gets thrown with that code. Also I am not using MultilayerPerceptron as it is too slow with this many features but I know how to replace that with LibSVM / NaiveBayes — Reily Bourne, Mar 15 '13 at 19:38
And then also how do I actually get it to give me the predicted class of the unknown data? The ones where the label is marked as "?". How can I get that and save it? — Reily Bourne, Mar 15 '13 at 19:50
@JoshWeissbock Maybe you should consider posting a separate question for this. — Sentry, Mar 15 '13 at 20:03
Here: http://stackoverflow.com/questions/15441428/learning-weka-on-the-command-line — Reily Bourne, Mar 15 '13 at 20:21
Btw: I've fixed the command line above, the "-c last" was not correct — Sentry, Mar 16 '13 at 09:20

Weka: ReplaceMissingValues for a test file

1 Answers1