-1

I want to use P10, P50 and P90 values as input to: A) generate a probability density function (This feels like a Myerson distribution, but I can't figure out how to do that in Python. There's an addon in Excel which does exactly that though; SIPMath) B) run a simulation (Monte Carlo?) on the PDF

Example: I want to make a simulation of how long it would take to run from A to B.

P10 = 1 hour
P50 = 1.5 hours
P90 = 2.5 hours

Meaning 10% of the attempts I run from A to B in 1 hour or less, 50% of the attempts I run from A to B in 1.5 hours or less (i.e. 1.5 is the mean). and 10% of the attempts I will spend more than 2.5 hours.

Thank you

flash
  • 41
  • 1
  • 3

3 Answers3

1

Assuming that it's appropriate to model this system with a Myerson distribution then, according to Frontline Solvers, "[i]f the specified percentiles are equidistant (measured by the parameter b’ below), then the Myerson distribution is equiva­lent to a Normal distribution." You're in luck with a simple case.

Of course this cannot be quite true because the normal has infinite tails. You would need to draw samples from a normal population that is truncated on the left.

The (untruncated) normal distribution you need has a mean of 1.5 hours, and puts 40% of its mass between 1 hour and that mean of 1.5 hours. The standard normal puts 40% of its mass between -1.2815515655446004 and 0. Then, given a supply of standard normal random deviates, z we could convert them to (untruncated) deviates of the kind needed by scaling them 0.5*(z+1.5)/1.28155, where 0.5 is the 'distance' between 1 hour and 1.5 hours, and 1.28155 is the corresponding 'distance' for the standard normal.

Being a normal distribution it's possible that some random variables less than zero might be generated. However, using the scipy library I find that,

>>> norm.cdf(0, loc=1.5, scale=0.5/1.28)
6.151715518325519e-05

I would say that this is so unlikely that it's not worth the bother to treat this as a truncated normal.

Therefore, to obtain a sample of Myerson deviates as defined in your question you could do this.

>>> from scipy.stats import norm
>>> sample = norm.rvs(loc=1.5, scale=0.5/1.28, size=100)

The values for loc and scale are as we've discussed. The value for size would be whatever sample size you require.

Bill Bell
  • 21,021
  • 5
  • 43
  • 58
  • Thanks for the input, Bill! This seems to put me in the right direction. But it looks like this method only works for a symmetrical (normal) distribution (b' = 1). How would you modify this method to account for b' != 1? – flash Jan 05 '18 at 08:20
  • @flash: I've put enough work into this already. I think it's your turn. – Bill Bell Jan 05 '18 at 17:11
1

Here is my attempt at a solution. In case b' = 1, the data are symmetric, and we should treat this as a normal distribution. pX_out approaches pX_in as number of samples increases. I would have liked to able able to set upper and lower barriers, but I havent figured out how to do that yet. Any suggestions would be appreciated. Thank you.

def myerson(p10, p50, p90, number_of_samples):
    b_mark = ((float(p90) - float(p50)) / (float(p50) - float(p10)))
    samples = []
    for i in range(number_of_samples):
        rand_numb = random.random()
        factor = norm.ppf(rand_numb, 0, 0.780304146072379)
        if 0.9999 < b_mark < 1.0001: 
            sample = p50 + (p90 - p50) * factor
        else:
            sample = p50 + (p90 - p50)*((b_mark**factor - 1)/(b_mark - 1))
        samples.append(sample)
    return samples

p10_in = 90
p50_in = 100
p90_in = 111
numberofsamples = 10000
data = myerson(p10_in, p50_in, p90_in, numberofsamples)

p10_out = np.percentile(data,10)
p50_out = np.percentile(data,50)
p90_out = np.percentile(data,90)
flash
  • 41
  • 1
  • 3
1

It turns out that metalog is the right distribution to use in this case. A very flexible distribution, which can treat 4 different scenarios: unbound, lower bound (with a minimum value), upper bound (maximum value) and bound (both a minimum and a maximum).

def metalog_multi(p10, p50, p90, numberofsamples, p0 = None, p100 = None):
    p10 = float(p10)
    p50 = float(p50)
    p90 = float(p90)
    if p0 != None:
        p0 = float(p0)
    if p100 != None:
        p100 = float(p100)


    samples = []
    for i in range(numberofsamples):
        x = random.random()
        if p0 == None and p100 == None:
            # unbound

            sample = p50 + 0.5 * (log((1 - 0.1) / 0.1)) ** (-1) * (p90 - p10) * log(x / (1 - x)) + ((1 - 2 * 0.1) * (log((1 - 0.1) / 0.1))) ** -1 * (1 - 2 * (p50 - p10) / (p90 - p10)) * (p90 - p10) * (x - 0.5) * log(x / (1 - x))

        elif p100 == None:
            # lower bound
            sample = p0 + e ** (log(p50 - p0) + 0.5 * (log((1 - 0.1) / 0.1)) ** -1 * log((p90 - p0) / (p10 - p0)) * log(x / (1 - x)) + ((1 - 2 * 0.1) * (log((1 - 0.1) / 0.1))) ** -1 * log(((p90 - p0) * (p10 - p0)) / (p50 - p0) ** 2) * (x - 0.5) * log(x / (1 - x)))
        elif p0 == None:
            # upper bound
            sample = p100 - e ** (-(-log(p100 - p50) - (0.5) * (log((1 - 0.1) / 0.1)) ** -1 * log((p100 - p90) / (p100 - p10)) * log(x / (1 - x)) - ((1 - 2 * 0.1) * (log((1 - 0.1) / 0.1))) ** -1 * log(((p100 - p90) * (p100 - p10)) / (p100 - p50) ** 2) * (x - 0.5) * log(x / (1 - x))))
        else:
            # bound
            sample = (p0 + p100 * e ** (log((p50 - p0) / (p100 - p50)) + (0.5) * (log((1 - 0.1) / 0.1)) ** -1 * log(((p90 - p0) / (p100 - p90)) / ((p10 - p0) / (p100 - p10))) * log(x / (1 - x)) + ((1 - 2 * 0.1) * (log((1 - 0.1) / 0.1))) ** -1 * log((((p90 - p0) / (p100 - p90)) * ((p10 - p0) / (p100 - p10))) / ((p50 - p0) / (p100 - p50)) ** 2) * (x - 0.5) * log(x / (1 - x)))) / (1 + e ** (log((p50 - p0) / (p100 - p50)) + (0.5) * (log((1 - 0.1) / 0.1)) ** -1 * log(((p90 - p0) / (p100 - p90)) / ((p10 - p0) / (p100 - p10))) * log(x / (1 - x)) + ((1 - 2 * 0.1) *(log((1 - 0.1) / 0.1))) ** -1 * log((((p90 - p0) / (p100 - p90)) * ((p10 - p0) / (p100 - p10))) / ((p50 - p0) / (p100 - p50)) ** 2) * (x - 0.5) * log(x / (1 - x))))
        samples.append(sample)
    return samples


p0_in = 10
p10_in = 20
p50_in = 40
p90_in = 80
p100_in = 250
numberofsamples = 10000
data = metalog_multi(p10_in, p50_in, p90_in, numberofsamples, p0 = p0_in)

p10_out = np.percentile(data,10)
p50_out = np.percentile(data,50)
p90_out = np.percentile(data,90)
flash
  • 41
  • 1
  • 3