Statistics in Python

Question

I'm taking a Machine Learning class and we are given our first statistics- "programming" exercise.

So the exercise goes like this:

Recall the story from the lecture “Two sellers at Amazon have the same price. One has 90 positive and 10 negative reviews. The other one 2 positive and 0 negative. Who should you buy from?” Write down the posterior probabilities about the reliability (as in the lecture). Calculate p(x > y|D1, D2) using numerical integration. You can gernate Beta distributed samples with the function scipy.stats.beta.rvs(a,b,size).

What we know from the lecture is the following:

applied two Beta-binomial models: p(x|D1) = Beta(x|91, 11) and p(y|D2) = Beta(y|3, 1)

Compute probability that seller 1 is more reliable than seller 2:

p(x > y | D1, D2 ) = ∫∫ [x > y] Beta (x| 91, 11) Beta (y| 3, 1) dx dy

So my attempts in Python are like that:

In [1]: import numpy as np
        from scipy import integrate, stats
In [2]: f = lambda x, y: stats.beta.rvs(91, 11, x) * stats.beta.rvs(3, 1, y)
In [3]: stats.probplot(result, x > y)

And I receive an error that states:

... The maximum number of subdivisions (50) has been achieved....

but ultimately there is an answer to the calculation that is approx. 1.7 . (We are told that the answer is approx. 0.7 )

My question is: How do I calculate the [x > y] part, meaning: probability that seller 1 (x) is more reliable than seller 2 (y) ?

Please review & edit accordingly what appears as a hyperlink in the rendered text of your question (SO does not support Latex). Plus, question has nothing to do with `machine-learning` - kindly do not spam the tag (removed). — desertnaut, Oct 25 '18 at 20:55
Can you please revise your Python attempt? `result`, `x` and `y` are not defined as far as I can tell? and you never use `f` — juanpa.arrivillaga, Oct 25 '18 at 20:56
I edited the HTML part. The ML-tag was added because the exercise was given in my ML-class, but I understand that it's not directly connected. Thanks for the tip :) — Liat Khitman, Oct 27 '18 at 11:00

Sam Mason · Accepted Answer · 2018-10-28T21:52:20.033

0

almost right, I'd do something like:

from scipy import stats

N = 10000
P = sum(stats.beta.rvs(3, 1, size=N) < stats.beta.rvs(91, 11, size=N))
P / N

and if you want a graphical display:

import matplotlib.pyplot as plt
import numpy as np

X = np.linspace(0.6, 0.8, 501)
Y = stats.beta.pdf(X, 1 + P, 1 + N - P)

plt.plot(X, Y)

there might be library code to do the plotting more nicely.

The above gives a Monte-Carlo estimate of the answer. If you want a better numerical estimate you can get there under quadrature using:

from scipy.integrate import dblquad
from scipy import stats

a = stats.beta(91, 11)
b = stats.beta(3, 1)

dblquad(
    lambda x, y: a.pdf(x) * b.pdf(y),
    0, 1, lambda x: x, lambda x: 1)

which gives me an estimate of ~0.712592804 (with an error of 2e-8).

If you want to get even more accurate you'd want to do something analytic:

https://stats.stackexchange.com/questions/7061/binomial-probability-question

which involves the use of some transcendentals I struggle to get my head around.

edited Oct 28 '18 at 21:52

answered Oct 25 '18 at 21:48

Sam Mason

15,216
1
41
60

Thanks a bunch!! Just a quick question though- where exactly am I integration here? hehe – Liat Khitman Oct 27 '18 at 11:02
don't think I really read your question; I was just fixing your use of generating random variates to get an answer. if you want a more accurate answer (the second block of code above just displays how accurate the monte-carlo estimate is) then you probably want to be using `dblquad` from `scipy.integrate` and `pdf` to get the probability density from your betas – Sam Mason Oct 28 '18 at 09:31
so I'm completely lost. How do I combine the `dblquad `and the already calculated P that you suggested? `def f(x, y): return P * stats.beta.rvs(3, 1, x) * stats.beta.rvs(91, 11, y)` and then `integrate.dblquad(f, 0, 1, lambda x: 0, lambda x: 1)` is obviously wrong. – Liat Khitman Oct 28 '18 at 15:49

Statistics in Python

1 Answers1