0

I have a very large dataset which I need to do some statistical analysis on. The data is too large to read in all at once, so I only have the binned histogram to work off. In particular, I would love to fit the cumulatives (i.e, number of counts to the right of each point x in the histogram).

Here's a script I have which makes some mock data:

mu, sigma = 0.3, 1.3
x1 = np.random.lognormal(mu, sigma, size = 100000) # random dist
bins = 10**arange(0, 4, 0.01) # actual bins my real data uses

a, b = np.histogram(x1, bins = bins)

# calculating the cumulatives
cum = []
for i, v in enumerate(a):
    cum.append(sum(a[i:]))

So the cumulative I want to fit looks like the following:

clf()
loglog(b[:-1], cum)
xlabel("Amps")
ylabel("# Occurences/Year")
show()

Plot of Cumulative which I need to fit

My questions are as follows:

1) How do I fit a lognormal to the cumulative? I see scipy.stats.lognorm.fit takes in the original dataset as an argument.

2) I see from this stack overflow question that you can 'restore' the data from the histogram. I'd like to work off the cumulative though. Is this the right approach?

As you can probably guess, I'm not used to working with these distributions.

Thanks!

  • Also, I see that this [question](https://stackoverflow.com/questions/42163438/fitting-binned-lognormal-data-in-python) tries to do something similar, except it fits the histogram. I'm keen to fit the cumulative, as the bin sizes will change the fit to the histogram. – phys_geo_person Aug 31 '17 at 10:14
  • Not a direct answer to your question, but you could also estimate the parameters of the lognormal distribution from the mean and variance of your data, see [here](https://stats.stackexchange.com/questions/26608/how-do-i-estimate-the-parameters-of-a-log-normal-distribution-from-the-sample-me) or [here](https://stats.stackexchange.com/questions/174449/can-i-get-the-parameters-of-a-lognormal-distribution-from-the-sample-mean-medi). – user8153 Aug 31 '17 at 22:51

0 Answers0