0

I am currently writing a thesis about a problem in which my algorithm matches nodes between two sets. I am having difficulties defining the proper formal description for the following situation:

  • Set A has c nodes,
  • Set B has also c nodes
  • Each node in A or B has exactly one correct match in the other set. (1:1)

The matching algorithm is finding node pairs between those two sets. After running the algorithm, there are r nodes that have been correctly matched.

I think that the accuracy score can by calculated as follows:

P=c # correct matches are c, because its 1:1
N=c*(c-1) # every node of a set (c) multiplied with ever node, except for the correct match (c-1)

TP=r # True positive: Equal nodes that are correctly identified
FP=c-r # False positives: Unequal nodes that are incorrectly identified
TN=N-FP # True negatives: Unequal nodes that are correctly identified
FN=P-TP # False negatives: Equal nodes that are incorrectly identified

Sensitivity = TP/P
Specifity = TN/N
Accuracy = (TP+TN)/(TP+TN+FP+FN)=(TP+TN)/(P+N)

One example:

c=10000 # nodes in Set A or B
r=6000 # correctly matched nodes

According to the above formulas:

TP=6000
FP=4000
TN=99986000
FN=4000

And this results in:

Sensitivity = 0.6
Specifity = 0.9999599959996
Accuracy = 0.99992

If these calculations are correct, it would mean that, even though I only matched 60% correctly, the accuracy is higher than 99%. Isn't the sensitivity the better indicator to measure the matching ability of the algorithm? Or maybe I just can't use this kind of binary classification for this problem?

  • Your mistake is thinking that there is a single thing called "accuracy". Its definition is highly dependent on what the goal of the problem is. That said. In this problem, I think you're massively overweighing true negatives. Your goal is to find matches, so I'd probably only care about TP - cFP, for some constant c to be determined. – Frank Yellin Sep 12 '22 at 22:41

0 Answers0