Correctly labeling predicted classes when using the Java WEKA library

Question

I have a program that trains an algorithm with a 2-class categorical outcome, then runs and writes out predictions (probabilities of each of the 2 classes) for an unlabeled data set.

All data sets run against this program will have the same 2 classes as the outcome. With this in mind I ran the predictions and used a little post-hoc statistics to figure out which column of results described which outcome, and proceeded to hard code them:

public class runPredictions {
public static void runPredictions(ArrayList al2) throws IOException, Exception{
    // Retrieve objects
    Instances newTest = (Instances) al2.get(0);
    Classifier clf = (Classifier) al2.get(1);

    // Print status
    System.out.println("Generating predictions...");

    // create copy
    Instances labeled = new Instances(newTest);

    BufferedWriter outFile = new BufferedWriter(new FileWriter("silverbullet_rro_output.csv"));
    StringBuilder builder = new StringBuilder();

    builder.append("Prob_Retain"+","+"Prob_Attrite"+"\n");
    for (int i = 0; i < labeled.size(); i++)      
    {
        double[] clsLabel = clf.distributionForInstance(newTest.instance(i));
        for(int j=0;j<2;j++){
           builder.append(clsLabel[j]+""); 
           if(j < clsLabel.length - 1)
               builder.append(",");
        }
        builder.append("\n");
    }
    outFile.write(builder.toString());//save the string representation
    System.out.println("Output file written.");
    System.out.println("Completed successfully!");
    outFile.close();    
}    
}

The problem with this is that it turns out that which of the 2 columns describes which of the 2 outcome categories is not fixed. It seems to have to do with which category appears first in the training data set, which is entirely arbitrary. So when other data sets were used with this program the hard-coded labels were backwards.

So, I need a better way to label them, but looking at the documentation for Classifier and distributionForInstance and I'm not seeing anything useful.

Update:

I figured out how to print it to the screen (thanks to this), but still had trouble with writing it to csv:

for (int i = 0; i < labeled.size(); i++)      
    {
        // Discreet prediction
        double predictionIndex = 
            clf.classifyInstance(newTest.instance(i)); 

        // Get the predicted class label from the predictionIndex.
        String predictedClassLabel =
            newTest.classAttribute().value((int) predictionIndex);

        // Get the prediction probability distribution.
        double[] predictionDistribution = 
            clf.distributionForInstance(newTest.instance(i)); 

        // Print out the true predicted label, and the distribution
        System.out.printf("%5d: predicted=%-10s, distribution=", 
                          i, predictedClassLabel); 

        // Loop over all the prediction labels in the distribution.
        for (int predictionDistributionIndex = 0; 
             predictionDistributionIndex < predictionDistribution.length; 
             predictionDistributionIndex++)
        {
            // Get this distribution index's class label.
            String predictionDistributionIndexAsClassLabel = 
                newTest.classAttribute().value(
                    predictionDistributionIndex);

            // Get the probability.
            double predictionProbability = 
                predictionDistribution[predictionDistributionIndex];

            System.out.printf("[%10s : %6.3f]", 
                              predictionDistributionIndexAsClassLabel, 
                              predictionProbability );

            // Attempt to write to CSV
            builder.append(i+","+predictedClassLabel+","+
                    predictionDistributionIndexAsClassLabel+","+predictionProbability);
                            //.charAt(0)+','+predictionProbability.charAt(0));

        }

        System.out.printf("\n");
        builder.append("\n");

score 1 · Accepted Answer · edited May 23 '17 at 10:29

I adapted the code below from this answer and this answer. Basically, you can query the test data for the class attribute, then obtain the specific value for each possible class.

for (int i = 0; i < labeled.size(); i++)      
{
// Discreet prediction

double predictionIndex = 
    clf.classifyInstance(newTest.instance(i)); 

// Get the predicted class label from the predictionIndex.
String predictedClassLabel =
    newTest.classAttribute().value((int) predictionIndex);

// Get the prediction probability distribution.
double[] predictionDistribution = 
    clf.distributionForInstance(newTest.instance(i)); 

// Print out the true predicted label, and the distribution
System.out.printf("%5d: predicted=%-10s, distribution=", 
                  i, predictedClassLabel); 

// Loop over all the prediction labels in the distribution.
for (int predictionDistributionIndex = 0; 
     predictionDistributionIndex < predictionDistribution.length; 
     predictionDistributionIndex++)
{
    // Get this distribution index's class label.
    String predictionDistributionIndexAsClassLabel = 
        newTest.classAttribute().value(
            predictionDistributionIndex);

    // Get the probability.
    double predictionProbability = 
        predictionDistribution[predictionDistributionIndex];

    System.out.printf("[%10s : %6.3f]", 
                      predictionDistributionIndexAsClassLabel, 
                      predictionProbability );

    // Write to CSV
    builder.append(i+","+
            predictionDistributionIndexAsClassLabel+","+predictionProbability);


}

System.out.printf("\n");
builder.append("\n");

}


// Save results in .csv file
outFile.write(builder.toString());//save the string representation

You are absolutely right, that i should be a different index! It is simply the instance you are evaluating. I will correct — Walter, Jan 31 '17 at 15:45
Thanks again. So the for loop for the first line should be something like `for (int j = 0; j < newTest.size(); j++)` right? And the `i` will always be 0 or 1 for the 2-category (2 label) case, but on the line where you have `newTest.classAttribute().value(i)` do we not need a `j` somewhere to access the right part of `newTest`? — Hack-R, Jan 31 '17 at 15:50
I think there may be a slight mistake or missing component (see comment above) but I got this working just now in my version of it and you deserve the credit, so I am about to edit your post with my version and mark it as the solution. If you want to roll back the edit I am about to make and just tweak your version that is totally cool. Thanks again for your help, I greatly appreciate it! — Hack-R, Jan 31 '17 at 15:55

Correctly labeling predicted classes when using the Java WEKA library

1 Answers1