Generate a Probability Density Function based on P10, P50 and P90 values in Python

Question

I want to use P10, P50 and P90 values as input to: A) generate a probability density function (This feels like a Myerson distribution, but I can't figure out how to do that in Python. There's an addon in Excel which does exactly that though; SIPMath) B) run a simulation (Monte Carlo?) on the PDF

Example: I want to make a simulation of how long it would take to run from A to B.

P10 = 1 hour
P50 = 1.5 hours
P90 = 2.5 hours

Meaning 10% of the attempts I run from A to B in 1 hour or less, 50% of the attempts I run from A to B in 1.5 hours or less (i.e. 1.5 is the mean). and 10% of the attempts I will spend more than 2.5 hours.

Thank you

Break your problem down into smaller bits and start writing some code. — Ma0, Jan 03 '18 at 13:27
There are an infinite amount of distributions that fit that criterion. How would you pick one here? — Willem Van Onsem, Jan 03 '18 at 13:27
So, what is actually your question and the problem you need help with? — 1313e, Jan 03 '18 at 13:43

score 1 · Answer 1 · answered Jan 03 '18 at 17:19

Assuming that it's appropriate to model this system with a Myerson distribution then, according to Frontline Solvers, "[i]f the specified percentiles are equidistant (measured by the parameter b’ below), then the Myerson distribution is equivalent to a Normal distribution." You're in luck with a simple case.

Of course this cannot be quite true because the normal has infinite tails. You would need to draw samples from a normal population that is truncated on the left.

The (untruncated) normal distribution you need has a mean of 1.5 hours, and puts 40% of its mass between 1 hour and that mean of 1.5 hours. The standard normal puts 40% of its mass between -1.2815515655446004 and 0. Then, given a supply of standard normal random deviates, z we could convert them to (untruncated) deviates of the kind needed by scaling them 0.5*(z+1.5)/1.28155, where 0.5 is the 'distance' between 1 hour and 1.5 hours, and 1.28155 is the corresponding 'distance' for the standard normal.

Being a normal distribution it's possible that some random variables less than zero might be generated. However, using the scipy library I find that,

>>> norm.cdf(0, loc=1.5, scale=0.5/1.28)
6.151715518325519e-05

I would say that this is so unlikely that it's not worth the bother to treat this as a truncated normal.

Therefore, to obtain a sample of Myerson deviates as defined in your question you could do this.

>>> from scipy.stats import norm
>>> sample = norm.rvs(loc=1.5, scale=0.5/1.28, size=100)

The values for loc and scale are as we've discussed. The value for size would be whatever sample size you require.

Thanks for the input, Bill! This seems to put me in the right direction. But it looks like this method only works for a symmetrical (normal) distribution (b' = 1). How would you modify this method to account for b' != 1? — flash, Jan 05 '18 at 08:20
@flash: I've put enough work into this already. I think it's your turn. — Bill Bell, Jan 05 '18 at 17:11

score 1 · Answer 2 · answered Jan 07 '18 at 20:03

Here is my attempt at a solution. In case b' = 1, the data are symmetric, and we should treat this as a normal distribution. pX_out approaches pX_in as number of samples increases. I would have liked to able able to set upper and lower barriers, but I havent figured out how to do that yet. Any suggestions would be appreciated. Thank you.

def myerson(p10, p50, p90, number_of_samples):
    b_mark = ((float(p90) - float(p50)) / (float(p50) - float(p10)))
    samples = []
    for i in range(number_of_samples):
        rand_numb = random.random()
        factor = norm.ppf(rand_numb, 0, 0.780304146072379)
        if 0.9999 < b_mark < 1.0001: 
            sample = p50 + (p90 - p50) * factor
        else:
            sample = p50 + (p90 - p50)*((b_mark**factor - 1)/(b_mark - 1))
        samples.append(sample)
    return samples

p10_in = 90
p50_in = 100
p90_in = 111
numberofsamples = 10000
data = myerson(p10_in, p50_in, p90_in, numberofsamples)

p10_out = np.percentile(data,10)
p50_out = np.percentile(data,50)
p90_out = np.percentile(data,90)

score 1 · Accepted Answer · answered Jan 17 '18 at 16:21

It turns out that metalog is the right distribution to use in this case. A very flexible distribution, which can treat 4 different scenarios: unbound, lower bound (with a minimum value), upper bound (maximum value) and bound (both a minimum and a maximum).

def metalog_multi(p10, p50, p90, numberofsamples, p0 = None, p100 = None):
    p10 = float(p10)
    p50 = float(p50)
    p90 = float(p90)
    if p0 != None:
        p0 = float(p0)
    if p100 != None:
        p100 = float(p100)


    samples = []
    for i in range(numberofsamples):
        x = random.random()
        if p0 == None and p100 == None:
            # unbound

            sample = p50 + 0.5 * (log((1 - 0.1) / 0.1)) ** (-1) * (p90 - p10) * log(x / (1 - x)) + ((1 - 2 * 0.1) * (log((1 - 0.1) / 0.1))) ** -1 * (1 - 2 * (p50 - p10) / (p90 - p10)) * (p90 - p10) * (x - 0.5) * log(x / (1 - x))

        elif p100 == None:
            # lower bound
            sample = p0 + e ** (log(p50 - p0) + 0.5 * (log((1 - 0.1) / 0.1)) ** -1 * log((p90 - p0) / (p10 - p0)) * log(x / (1 - x)) + ((1 - 2 * 0.1) * (log((1 - 0.1) / 0.1))) ** -1 * log(((p90 - p0) * (p10 - p0)) / (p50 - p0) ** 2) * (x - 0.5) * log(x / (1 - x)))
        elif p0 == None:
            # upper bound
            sample = p100 - e ** (-(-log(p100 - p50) - (0.5) * (log((1 - 0.1) / 0.1)) ** -1 * log((p100 - p90) / (p100 - p10)) * log(x / (1 - x)) - ((1 - 2 * 0.1) * (log((1 - 0.1) / 0.1))) ** -1 * log(((p100 - p90) * (p100 - p10)) / (p100 - p50) ** 2) * (x - 0.5) * log(x / (1 - x))))
        else:
            # bound
            sample = (p0 + p100 * e ** (log((p50 - p0) / (p100 - p50)) + (0.5) * (log((1 - 0.1) / 0.1)) ** -1 * log(((p90 - p0) / (p100 - p90)) / ((p10 - p0) / (p100 - p10))) * log(x / (1 - x)) + ((1 - 2 * 0.1) * (log((1 - 0.1) / 0.1))) ** -1 * log((((p90 - p0) / (p100 - p90)) * ((p10 - p0) / (p100 - p10))) / ((p50 - p0) / (p100 - p50)) ** 2) * (x - 0.5) * log(x / (1 - x)))) / (1 + e ** (log((p50 - p0) / (p100 - p50)) + (0.5) * (log((1 - 0.1) / 0.1)) ** -1 * log(((p90 - p0) / (p100 - p90)) / ((p10 - p0) / (p100 - p10))) * log(x / (1 - x)) + ((1 - 2 * 0.1) *(log((1 - 0.1) / 0.1))) ** -1 * log((((p90 - p0) / (p100 - p90)) * ((p10 - p0) / (p100 - p10))) / ((p50 - p0) / (p100 - p50)) ** 2) * (x - 0.5) * log(x / (1 - x))))
        samples.append(sample)
    return samples


p0_in = 10
p10_in = 20
p50_in = 40
p90_in = 80
p100_in = 250
numberofsamples = 10000
data = metalog_multi(p10_in, p50_in, p90_in, numberofsamples, p0 = p0_in)

p10_out = np.percentile(data,10)
p50_out = np.percentile(data,50)
p90_out = np.percentile(data,90)

Generate a Probability Density Function based on P10, P50 and P90 values in Python

3 Answers3