0

I have a dataset of an arbitrary number of elements (say, 10000) which follows lognormal distribution over a certain range of values (say, between 1 and 500). I successfully fit a distribution with the powerlaw module in Python. Then I need to generate values from this distribution in a way that gives me values within the given bounds (between 1 and 500 of course with reasonable tolerance) and as many as the input dataset. I have tried the generators included in the powerlaw module itself and while they work, the values generated far exceed the maximum I can accept - I have a maximum of around 500 and the synthetic dataset routinely hits 6600.

I have attempted generating lognormal by plugging a product of truncated normal to exp function:

def generate_lognormal(xmin, xmax,mu, sigma, n): 
    from scipy.stats import truncnorm
    import numpy as np
    # the idea is: get items from truncated normal, then exponentiate it.
    minBound = (np.log(xmin) - mu)/sigma # get the "x" value for the desired lower bound
    maxBound = (np.log(xmax) - mu)/sigma # get the "x" value for the desired upper bound
    rand = truncnorm.rvs(minBound, maxBound, loc=0, scale=1, size=n, random_state=None) # Generate from truncated normal
    return exp(mu+sigma*rand) # return lognormal

But here I faced a different problem - most of the time the generated data only reach around half of the desired range - the largest values end up at 200-300 with zero cases closer to 500. In fact both scenarios (overshoot and undershoot) can happen with this code. Is there any way of generating values from lognormal (and power law) distributions within bounds which will be stable between iterations?

Julian_P
  • 101

1 Answers1

0

You are looking for rejection sampling. Simply draw values and discard any that are out of range.

That said, I'm concerned the powerlaw distribution you fit does not well-model the experimental dataset. Eyeball some plots, or use a t-test to compare the two sets of figures.

J_H
  • 17,926
  • 4
  • 24
  • 44