"Consensus" Among Maximum Entropy Classifications

Question

Imagine we have three classes: A, B, and C, and we classify a document 'd' using a standard MaxEnt classifier, and come up with the following probabilities:

P(d, A) = 0.50
P(d, B) = 0.25
P(d, C) = 0.25

I feel like that is very different, in a way, from this set of probabilities:

P(d, A) = 0.50
P(d, B) = 0.49
P(d, C) = 0.01

Is there a way to score the difference between these two?

score 3 · Accepted Answer · answered Dec 08 '13 at 14:57

3

The problem you are facing is often called the "consensus" among classifiers. As multilabel MaxEnt can be seen as N independent classifiers, you can think about it as a group of models "voting" for different classes.

Now, there are many measures of calculating such "consensus", including:

"naive" calculation of the margin - difference between the "winning" class probability and the second one - bigger the margin - more confident the classification
entropy - smaller the entropy of the resulting probability distribution, the more confident the decision
some further methods involving KL divergence etc.

In general you should think about methods od detecting "uniformity" of the resulting distribution (impling less confident decison) or "spikeness" (indicating more confident classification).

answered Dec 08 '13 at 14:57

lejlot

64,777
8
131
164

+1 for entropy. Fun fact: The reason why Max Ent classifiers are called that way is that they try to maximise the entropy of P(output|input) while respecting the training data. In a way, the classifier tries to find the most unbiased probability distribution that is in line with the training data. – mbatchkarov Dec 09 '13 at 11:54
Uniformity is the wrong thing to go for---poor probability models can often provide very spiky posteriors which are entirely incorrect. You need to reference the correct posterior, through cross-entropy (KL Divergence as you suggest) to ensure that your distribution is correct. After all, uniform posteriors might actually be accurate... – Ben Allison Dec 10 '13 at 09:29

score 1 · Answer 2 · answered Dec 09 '13 at 12:18

What you're looking for is cross-entropy: specifically, you want to calculate the cost of approximating the true distribution with the one output by your classifier. Probabilistic multi-class classifiers will optimise this directly in many cases. Take a look at this.

"Consensus" Among Maximum Entropy Classifications

2 Answers2