3

I would like to generate a list of n strictly positive values such that the list has a predetermined mean and standard deviation (can be close/not exact). I was using the uniform distribution's equations for expectation and variance and solving for 'a' and 'b', but the system of equations (for the specific mean and std. dev. I wanted) had no solutions for a, b >= 0. I was wondering if there was a plug-and-chug method to do this in any programming language, but hopefully in Python. Thanks!

Ex: generate list of 84 positive values with mean ~= 60/84 = 0.71, std.dev. ~= 1.7

4pie0
  • 29,204
  • 9
  • 82
  • 118
sgchako
  • 109
  • 1
  • 8
  • In response to deleted comment suggesting I use normal distribution with specified mean/std dev: Yes, and I have tried that, but using the example mean and std dev, a normal dist is bound to generate some negative values. my goal is to have no negative values in my list. – sgchako Apr 18 '14 at 18:00
  • If maintaining the mean and standard deviation is important, I suspect that you are making some assumption about the distribution of those values. Aside from being strictly positive, do you have other requirements, or can the probability function totally arbitrary? – Taro Sato Apr 18 '14 at 18:14
  • @TaroSato the fn can be totally arbitrary. The only thing I am concerned with are the mentioned constraints. – sgchako Apr 18 '14 at 18:18
  • Also, do you need to ensure that a set of values having its mean and standard deviation "very close" to those predefined? I ask because if you draw a set of values from some probability distribution, with a small sample your mean and standard deviation computed from that sample can be quite a bit different from those of the parent population. In practice, we don't do this kind of thing (we only make an assumption about the probability distribution), but it is not clear from the question what the goal really is. – Taro Sato Apr 18 '14 at 18:18
  • Yes, that is another problem I ran into with an earlier approach I tried. Sample mean and variance being very close is okay, but far is not (even if the distribution does have the assumed mean and std dev). What is really important is the sample mean/sample variance. I really could not care what the distribution looks like. Hopefully I've made the problem clear at this point. let me know if i can further clarify – sgchako Apr 18 '14 at 18:20

4 Answers4

1

Assume a (continuous) uniform distribution with the minimum a and the maximum b. Such distribution has the mean and variance:

mean = (a + b) / 2

var = (b - a)^2 / 12

where the standard deviation is simply sqrt(var). Given the mean and variance (and therefore standard deviation), the set of equations can be solved for a and b:

a = mean - sqrt(3 * var)

b = mean + sqrt(3 * var)

For creating a list having this set of mean and variance, you simply want to generate n equally separated points within [a, b]. A Python code snippet follows:

#!/usr/bin/env python2.7
from math import sqrt


def uniform(mean, std, n):
    a = mean - sqrt(3.) * std
    b = mean + sqrt(3.) * std
    xs = [(b - a) * (i / (n - 1.)) + a for i in range(n)]
    return xs


for target_mean, target_std, n in [(10, 1, 100),
                                   (0.71, 1.7, 84)]:
    xs = uniform(target_mean, target_std, n)
    print xs

    mean = 1. * sum(xs) / n
    var = sum([(x - mean)**2 / n for x in xs])

    print 'mean: {} ({})'.format(mean, target_mean)
    print 'std: {} ({})'.format(sqrt(var), target_std)

    if not min(xs) > 0:
        print 'WARNING: but this is not strictly positive'

    print

Note that a certain combination of mean and variance yields negative values, so you need to conditionally exclude them. You can alternatively choose some other probability distribution function that only draws strictly positive numbers. How easy it is to relate the mean and variance to the parameters that characterize the distribution really depends. I arbitrarily picked uniform because it is simple.

However, I find the premise of the original question a bit contrived, so depending on the problem doing this sort of thing might not actually be desirable.

Taro Sato
  • 1,444
  • 1
  • 15
  • 19
  • Thanks Taro Sato for your work! Your comment about certain combinations of mean and variance yielding negative values is something I ran into earlier, and described in my original post. I guess there is no way to do this for certain predetermined means/variances. – sgchako Apr 18 '14 at 19:27
  • With any distribution you must assume, there are certain restrictions in the way mean and standard deviation can be chosen so that you cannot choose everything freely, arbitrary and hope the values drawn are strictly positive. With my example, you just need to choose mean and standard deviation so that the minimum b are positive. You could do something similar with different distribution, but then some other restrictions are imposed (like with Poisson, the mean and variance is related one-to-one). You simply cannot pick everything arbitrarily. Otherwise you need something by brute force. – Taro Sato Apr 18 '14 at 19:35
  • Are there any ways to implement a brute force method (like plug-and-chug) in Python? Something to solve a system like sum a_i = x, sum (a_i)^2 = y, with a_i >= 0 & x,y predetermined? It that were possible with some plug-and-chug solution, I could generate a suitable list that way. Just never implemented something like that before, don't know how to tell the computer to try numbers until it gets something that works. – sgchako Apr 18 '14 at 19:42
  • Not that I know that's efficient enough or really not contrived. I just don't find the premise of the question making sense (which is the reason why I think it is not a proper approach to the question you are really trying to solve). The mean and standard deviation is a way of characterizing probability distribution, so if you are saying the distribution really does not matter, then I have to question why you care so much about the mean and standard deviation to begin with (because without a distribution, they are not very meaningful for interpretation). – Taro Sato Apr 18 '14 at 19:51
  • The problem I am tackling is basically data justification (not malicious, i promise :) ). For instance, say you were observing a queue at a pizza parlor and measuring arrival times of customers. You would get a list of n entry times (each of course being positive). The list would have a mean and a standard deviation. The distribution of arrival times at this pizza shop is unknown, of course (so does not really matter). What I am trying to do is, given some predetermined mean and standard deviation, generate a list of arrival times that will result in said mean & std. dev. – sgchako Apr 18 '14 at 20:13
  • this is not correct, and yes, you can draw from truncated normal distribution – 4pie0 Apr 18 '14 at 20:16
  • I didn't know exactly know what the context in which the original question was necessary, so I cannot provide the most optimal solution. It is of course possible to design some arbitrary probability distribution that satisfies the premise of the question, but if it can really be arbitrary, then there is probably no optimal solution either. But if you are dealing with arrival times of customers, then you could look into something like the Poisson process and start from there. – Taro Sato Apr 18 '14 at 20:35
  • How could such a probability dist be designed? The issue with poisson is that it only takes one parameter, and that parameter defines its mean and variance. It seems more like this problem could be tackled best by some constrained optimization problem, but I have no experience designing an implementation of one, which is why I phrased the original post the way I did (plug-and-chug method). – sgchako Apr 18 '14 at 22:00
  • As any arbitrary distribution would do, you could use the truncated normal distribution mentioned (see Wikipedia for detail). I realized it is not so complicated to find the corresponding PDF for a = 0 and b -> infinity for that model. Once the PDF is found, then the question becomes how to scale that distribution and to bin the resulting histogram so that the total number of count equals the desired length for the list. You construct the list by multiplying the entries by count at each bin. I am not aware of any plug-and-chug solution for the latter process, but that might just be me. – Taro Sato Apr 19 '14 at 00:26
  • Conjugate prior to Poisson distribution will solve the problem nicely, see [my answer](http://stackoverflow.com/a/23166439/1756702). – A. Webb Apr 19 '14 at 05:52
  • Nice answer, but I had an impression that the question asker was also concerned with the sample mean and std. dev. deviating from the parameters defining the parent distribution, when the sample size is small. I thought the issue could be mitigated by integrating over the pdf, normalizing to the list size desired, rather than randomly sampling. I felt that was something quite contrived. Anyways... – Taro Sato Apr 19 '14 at 06:31
  • BTW, the sample mean will equal the desired mean, and the sample std. dev. will be sqrt(1+2/(n-1)) times the desired std. dev. And will print the warning when mean/std.dev. < sqrt(3), of course. – Teepeemm Apr 22 '14 at 16:56
1

Answer

Use NumPy to generate samples from a gamma distribution with scale parameter theta = variance / mean and shape parameter k = mean / theta.

Example

>> import numpy

>> mu = 0.71
>> var = 1.7**2 
>> theta = var / mean 
>> k = mu / theta

>> samples = numpy.random.gamma(k, theta, 1000)

>> numpy.mean(samples)
0.71622189354608201

>> numpy.std(samples) 
1.7865898752966483

Commentary

The constraints you provide underspecify the distribution. Some of the comments you made in response to another answer would have been helpful as part of the question. In particular, it seems as if you might be trying to model arrivals in a queue, e.g. a Poisson process. As you pointed out, the mean and variance of a Poisson distribution are the same, the lambda parameter. However, consider the lambda itself as a random variable. The conjugate prior to the Poisson distribution the Gamma distribution.

With shape parameter k > 0 and scale parameter theta > 0, the gamma distribution has mean = k * theta and variance = k * theta^2. Therefore, theta is variance / mean > 0 and k is mean / theta > 0. Since the gamma distribution has positive support, this conveniently answers your question.

Community
  • 1
  • 1
A. Webb
  • 26,227
  • 1
  • 63
  • 95
  • This worked very well. I added an if else block to resample if the resulting list did not have mean/std.dev within a 5% margin of error of my desired mean/std.dev. Thanks! – sgchako Apr 21 '14 at 03:51
0

The requirement for construction of distribution with given mean and deviation is impossible to be satisfied if deviation is greater than distance from mean to any bound. To see this let's first notice that in sample

x1, x2, ..., mean, ... , xn

with mean mi = sum(x_i)/n

deviation is bounded:

dev < xmax - mean, and dev < mean - xmin. Without providing formula it is quite intuitive since the meaning of it is average deviation from the mean - how could it be greater than the maximum deviation ( max of ( mean - xmin, xmax - mean)) from the mean?

So if deviation is greater than max of [ mean - xmin, xmax - mean] then we have error. Now let's take a look at two other cases:

  • when it is in range (0, min of[ mean - xmin, xmax - mean])

  • and when it is in range (0, max of[ mean - xmin, xmax - mean]) but not in range (0, min of[ mean - xmin, xmax - mean]), ( so it is greater than one bound, but less then other one)


When it is in range (0, min of[ mean - xmin, xmax - mean])

Bernoulli distribution

This is simple to construct some distribution that yields sample with mean mi and deviation d with all values in range [xmin, xmax]. The simple case of two points distribution with

x1 = mi - d, x2 = mi + d

has the expectation of mi and deviation of d.

#include <boost/random.hpp>
#include <boost/random/bernoulli_distribution.hpp>

double generate_from_bernoulli_distribution(double mi, double d,
                                                        double a, double b) {
    if (b <= a || d < 0) throw std::out_of_range( "invalid parameters");
    if (d > std::min(mi - a, b - mi)) throw std::out_of_range( " invalid
                                                         standard deviation");
    double x1 = mi - d, x2 = mi + d;
    boost::mt19937 rng; // I don't seed it on purpouse (it's not relevant)
    boost::bernoulli_distribution<> bd;
    boost::variate_generator<boost::mt19937&,
            boost::bernoulli_distribution<> > var_ber( rng, bd);
    double bernoulli = var_ber();
    return ( x1 + bernoulli * 2 * d); // return x1 on 0, or x2 on 1
}

void generate_n_from_bernoulli_distribution( double mi, double d, double a, 
                                   double b, std::vector<double>& res, int n) {
    if (b <= a || d < 0) throw std::out_of_range( "invalid parameters");
    if (d > std::min(mi - a, b - mi)) throw std::out_of_range( " invalid
                                                          standard deviation");
    double x1 = mi - d, x2 = mi + d;

    boost::mt19937 rng; // I don't seed it on purpouse (it's not relevant)
    boost::bernoulli_distribution<> bd;
    boost::variate_generator<boost::mt19937&,
            boost::bernoulli_distribution<> > var_ber( rng, bd);

    int i = 0;
    for (; i < n; ++i) {
        double bernoulli = var_ber();
        res.push_back( x1 + bernoulli * 2 * d); // push_back x1 on 0, or x2 on 1
    }
}

usage:

/*
 * 
 */
int main()
{
  double rc = generate_from_bernoulli_distribution( 4, 1, 0, 10);
  std::vector<double> sample;
  generate_n_from_bernoulli_distribution( 4, 1, 0, 10, sample, 100);
  return 0;
}

The case of Bernoulli, two points distribution is the first to consider as it has the weakest requirements. Sometimes it will be possible to draw also from other distributions, for example from uniform distribution.


Uniform distribution

The first two moments of uniform distribution (the mean and variance) in terms of its range [a, b] are given by

enter image description here

enter image description here

where

a = mi - alpha b = mi + alpha alpha is any real number

So there are infote number of uniform distributions that yield mean mi. All of them are just centered over mi. Additional requirement, for a variance gives us single solution for a, b:

enter image description here

enter image description here

/**
 * generates intervals for a uniform distribution
 * with a given mean and deviation
 * @param mi    mean
 * @param d     deviation
 * @param a     left bound
 * @param b     right bound
 * @return 
 */
void uniform_distribution_intervals( double mi, double d, double& a, double& b) {
    a = mi - d * std::sqrt(3.0);
    b = mi + d * std::sqrt(3.0);
}

It is clear that not always it is possible to find uniform distribution for a given mi, d, which will have left bound greater than 0. In this case

uniform_distribution_intervals( 60/84, 1.7, a, b);

unfortunately returns a = -2.9444863728670914, b = 2.9444863728670914.


when it is in range (0, max of[ mean - xmin, xmax - mean]) but not in range (0, min of[ mean - xmin, xmax - mean])

left as useful exercise

4pie0
  • 29,204
  • 9
  • 82
  • 118
  • I'm having trouble following what you're saying. I think you're trying to say that you'll run into trouble if the standard deviation is significantly larger than the mean. By "deviation", do you mean the maximum distance between any sample and the mean? – Teepeemm Apr 22 '14 at 16:38
  • no, I say standard deviation can't be max that max of ( mean - xmin, xmax - mean)). By standard deviation I mean http://en.wikipedia.org/wiki/Standard_deviation – 4pie0 Apr 22 '14 at 16:58
0

Saying a "distribution is unknown" is different than "does not really matter" (both statements are in the same comment to Taro Sato’s answer). One way to get a desired mean and standard deviation is to set M=mean+var^2/mean and have some samples barely positive and the other samples be M. By making the samples correctly, you’ll get the mean and standard deviation. In the case you listed: M=4.78, 12 samples of M, and 68 samples of .001 would give mean=.718 and std.dev.=1.71. But arrival times are not accurately modeled as some 0 and some M.

Teepeemm
  • 4,331
  • 5
  • 35
  • 58