I'd be very grateful if someone could help me understand where I'm going wrong. I have some data describing probability distributions. The data provides me with values for P10, P50 and P90. I also know that the distribution is lognormal.
I've read that, for a random variable X that is log normally distributed, then Y = ln(X) has a normal distribution - e.g. wikipedia (https://en.wikipedia.org/wiki/Log-normal_distribution).
However, when I try to understand this using scipystats and numpy, I cannot get it to be true. Since I know it is true and I know there are no issues with the simple functions I'm using in these python libraries I know that there is a gap in my understanding somewhere. I just, for the life of me, cannot see what I'm missing...
The code I'm using is:
# build a lognormal distribution with scipystats (ss):
# set parameters (based on the standard normal distribution mu=0 and sigma=1:
s, mu, sd, size = 0.5,0,1,100000
# save the distribution:
X = ss.lognorm.rvs(s,loc=mu,scale=sd,size=size)
# convert to normal distribution (i.e. calc the natural log of X):
Y = np.log(X)
# Check if Y is normal using ratio between p90-p50 and p50-p10 - should be 1:
p10,p50,p90 = np.percentile(Y,[10,50,90])
(p90-p50)/(p50-p10)
The above returns 0.9932 - or something else pretty close to 1. So far so good. I can vary s and scale as much as I like (or have tried so far) and the normal test always comes close to 1. The problem comes if I vary mean (mu, loc):
# build a lognormal distribution with scipystats (ss):
# set parameters (normal distribution mu=100 and sigma=10:
s, mu, sd, size = 0.5,100,10,100000
# save the distribution:
X = ss.lognorm.rvs(s,loc=mu,scale=sd,size=size)
# convert to normal distribution (i.e. calc the natural log of X):
Y = np.log(X)
# Check if Y is normal using ratio between p90-p50 and p50-p10 - should be 1:
p10,p50,p90 = np.percentile(Y,[10,50,90])
(p90-p50)/(p50-p10)
In this instance the answer I get is around 1.8 - i.e. not a normal distribution. Like I say, I'm clearly misunderstanding something, but i can't see what it is.
In summary, if I use ss.lognorm.rvs
to calculate a series of log normally distributed random variables with loc of anything other than 0, and then use np.log
to get the natural log of the random variables, then this new distribution is not normally distributed which, on the surface, appears to violate the rule described at the top of the wikipedia article linked at the top of this question!
I'm very grateful for any help anyone can give me - I just want to be confident that I understand how to relate the lognormal data to a normal curve!