How should you test the significance of 2 classification accuracy scores: paired permutation test

Question

I have a single trained classifier tested on 2 related multiclass classification tasks. As each trial of the classification tasks are related, the 2 sets of predictions constitute paired data. I would like to run a paired permutation test to find out if the difference in classification accuracy between the 2 prediction sets is significant.

So my data consists of 2 lists of predicted classes, where each prediction is related to the prediction in the other test set at the same index.

Example:

actual_classes = [1, 3, 6, 1, 22, 1, 11, 12, 9, 2]
predictions1 = [1, 3, 6, 1, 22, 1, 11, 12, 9 10] # 90% acc.
predictions2 = [1, 3, 7, 10, 22, 1, 7, 12, 2, 10] # 50% acc.

H0: There is no significant difference in classification accuracy.

How do I go about running a paired permutation test to test significance of the difference in classification accuracy?

score 0 · Answer 1 · answered Feb 27 '22 at 10:43

I have been thinking about this and I'm going to post a proposed solution and see if someone approves or explains why I'm wrong.

actual_classes = [1, 3, 6, 1, 22, 1, 11, 12, 9, 2]
predictions1 = [1, 3, 6, 1, 22, 1, 11, 12, 9 10] # 90% acc.
predictions2 = [1, 3, 7, 10, 22, 1, 7, 12, 2, 10] # 50% acc.
paired_predictions = [[1,1], [3,3], [6,7], [1,10], [22,22], [1,1], [11,7], [12,12], [9,2], [10,10]]

actual_test_statistic = predictions1 - predictions2 # 90%-50%=40 # 0.9-0.5=0.4
all_simulations = [] # empty list
for number_of_iterations:
    shuffle(paired_predictions) # only shuffle between pairs, not within
    simulated_predictions1 = paired_predictions[first prediction of each pair]
    simulated_predictions2 = paired_predictions[second prediction of each pair]
    simulated_accuracy1 = proportion of times simulated_predictions1 equals actual_classes
    simulated_accuracy2 = proportion of times simulated_predictions2 equals actual_classes
    all_simulations.append(simulated_accuracy1 - simulated_accuracy2) # Put the simulated difference in the list

p = count(absolute(all_simulations) > absolute(actual_test_statistic ))/number_of_iterations

If you have any thoughts, let me know in the comments. Or better still, provide your own corrected version in your own answer. Thank you!

How should you test the significance of 2 classification accuracy scores: paired permutation test

1 Answers1