unrated versus negative-rated entities with Wilson score -- how to handle?

Question

Having read How Not To Sort By Average Rating I thought I should give it a try.

CREATE FUNCTION `mydb`.`LowerBoundWilson95` (pos FLOAT, neg FLOAT)
RETURNS FLOAT DETERMINISTIC
RETURN
IF(
    pos + neg <= 0,
    0,
    (
        (pos + 1.9208) / (pos + neg)
        -
        1.96 * SQRT(
            (pos * neg) / (pos + neg) + 0.9604
        )
        / (pos + neg)
    )
    /
    (
        1 + 3.8416
        / (pos + neg)
    )
);

Running some tests, I discover that objects with pos=0 and neg>0 have very small, but non-negative scores, whereas an object with pos=neg=0 has a score of zero, ranking lower.

I am of the opinion that an unrated object should be listed above one which has no positive ratings but some negatives.

I reasoned that "the individual ratings are all really expressions of deviation from some baseline, so I'll move the baseline, I'll give every object a 'neutral' initial score," so I came up with this:

CREATE FUNCTION `mydb`.`AdjustedRating` (pos FLOAT, neg FLOAT)
RETURNS FLOAT DETERMINISTIC
RETURN
(
    SELECT `mydb`.`LowerBoundWilson95` (pos+4, neg+4)
);

Here are some sample outputs for AdjustedRating

  \  pos  0       1       2
neg
 0   | 0.215 | 0.188 | 0.168
 1   | 0.266 | 0.235 | 0.212
 2   | 0.312 | 0.280 | 0.235

This is closer to the sort of scores I want and as a numerical hack I guess it's workable, but I can't mathematically justify it

Is there a better way, a "right" way?

Could this be partly a float/rounding issue? I have implemented this algorithm in Python and fro `pos=0` and `neg>0` I get `0.0` i.e. unrated and downrated items are equal. Still not what you want, your offset seems to work well — Anentropic, Jan 22 '14 at 17:42

score 2 · Accepted Answer · edited Jun 20 '20 at 09:12

The problem arises because this approximation (lower confidence bound) is really meant for identifying the highest rated items of a list. If you were interested in the lowest ranked, you could take the upper confidence bound instead.

Alternatively, we use Bayesian statistics which is the formalization of exactly the second method you describe. Evan Miller actually had a followup post to this in which he said:

The solution I proposed previously — using the lower bound of a confidence interval around the mean — is what computer programmers call a hack. It works not because it is a universally optimal solution, but because it roughly corresponds to our intuitive sense of what we'd like to see at the top of a best-rated list: items with the smallest probability of being bad, given the data.

Bayesian statistics lets us formalize this intuition...

Using the Bayesian ranking approach, any point that has zero data would fall back to the prior mean (what you refer to as the initial score) and then move away from it as it collects data. This is also the approach used at IMDB to compute their top Movies lists. https://math.stackexchange.com/questions/169032/understanding-the-imdb-weighted-rating-function-for-usage-on-my-own-website

The specific method you suggest of crediting each object 4 upvotes and 4 downvotes is equivalent to putting a mean of 0.5 with a weight of 8 votes. Given an absence of any other data, this is a reasonable start. Laplace famously argued in the sunrise problem that events should be credited with 1 success and 1 failure. In the item ranking problem, we have a lot more knowledge, so it makes sense to set the prior mean equal to the average ranking. The weight of this prior mean (or how fast you move off it as a function of data, also called the prior variance) can be challenging to set.

For IMDB's ranking of the Top 250 Movies, they use a mean movie ranking of 7.1 with a weight of 25000 votes, which is equivalent to treating all movies as if they started with 25000 "free" votes with a rating of 7.1

Thanks that's very interesting, but if you want a lower bound for the highest-rated items and an upper bound for the lowest-rated items, how do you have a single formula to rank *all* items? What if all ratings for the subset of interest are negative or nonexistant? I would want the same formula to work in all situations. — spraff, May 16 '15 at 13:52
The Bayesian ranking approach replaces both the upper and lower confidence bounds by using an average that falls back to the global average when there is little data. For example, IMDB uses the same formula to do their Top Movies as well as the list of the worst movies of all time. If no rankings are available, the item simply gets the global average and thus would be ranked in the middle of all items. — bscan, May 17 '15 at 21:02

unrated versus negative-rated entities with Wilson score -- how to handle?

1 Answers1

Linked