2

I would like to make a lognormal fit to my already binned data. The bar plot looks like this: enter image description here

Unfortunately, when I try to use the standard lognorm.pdf() the shape of the fitted distribution is very different. I guess it's because my data is already binned. Here's the code:

times, data, bin_points = ReadHistogramFile(filename)

xmin = 200
xmax = 800
x = np.linspace(xmin, xmax, 1000)
shape, loc, scale = stats.lognorm.fit(data, floc=0)
pdf = stats.lognorm.pdf(x, shape, loc=loc, scale=scale)

area=data.sum()
plt.bar(bars, data, width=10, color='b')
plt.plot(x*area, pdf, 'k' )

Here's what the fitted distribution looks like: enter image description here Obviously there's something wrong also with the scaling. I'm less concerned about that though. My main issue is, the shape of the distribution. This might be duplicate to: this question but I could not find a correct solution. I tried it and still get a very similar shape as when doing the above. Thanks for any help!

Update: By using curve_fit() I was able to get somewhat of a fit. But I'm not satisfied yet. I'd like to have the original bins and not unity-bins. Also I'm not sure, what exactly is happening, and if there is not a better fit. Here's the code:

def normalize_integral(data, bin_size):
normalized_data = np.zeros(size(data))
print bin_size
sum = data.sum()
integral = bin_size*sum
for i in range(0, size(data)-1):
    normalized_data[i] = data[i]/integral

print 'integral:', normalized_data.sum()*bin_size
return normalized_data



def pdf(x, mu, sigma):
"""pdf of lognormal distribution"""

return (np.exp(-(np.log(x) - mu)**2 / (2 * sigma**2)) / (x * sigma * np.sqrt(2 * np.pi)))


bin_points=np.linspace(280.5, 1099.55994, len(bin_points))
data=[9.78200000e+03 1.15120000e+04 1.18000000e+04 1.79620000e+04 2.76980000e+04   2.78260000e+04   3.35460000e+04   3.24260000e+04 3.16500000e+04   3.30820000e+04   4.84560000e+04   5.86500000e+04 6.34220000e+04   5.11880000e+04   5.13180000e+04   4.74320000e+04 4.35420000e+04   4.13400000e+04   3.60880000e+04   2.96900000e+04 2.66640000e+04   2.58720000e+04   2.57560000e+04   2.20960000e+04 1.46880000e+04   9.97200000e+03   5.74200000e+03   3.52000000e+03 2.74600000e+03   2.61800000e+03   1.50000000e+03   7.96000000e+02 5.40000000e+02   2.98000000e+02   2.90000000e+02   2.22000000e+02 2.26000000e+02   1.88000000e+02   1.20000000e+02   5.00000000e+01 5.40000000e+01   5.80000000e+01   5.20000000e+01   2.00000000e+01 2.80000000e+01   6.00000000e+00   0.00000000e+00   0.00000000e+00 0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00 0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00 0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00 0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00 0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00 0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00 0.00000000e+00   0.00000000e+00]
normalized_data_unitybins = normalize_integral(data,1)


plt.figure(figsize=(9,4))
ax1=plt.subplot(121)
ax2=plt.subplot(122)
ax2.bar(unity_bins, normalized_data_unitybins, width=1, color='b')
fitParams, fitCov = curve_fit(pdf, unity_bins, normalized_data_unitybins, p0=[1,1],maxfev = 1000000)
fitData=pdf(unity_bins, *fitParams)
ax2.plot(unity_bins, fitData,'g-')

ax1.bar(bin_points, normalized_data_unitybins, width=10, color='b')
fitParams, fitCov = curve_fit(pdf, bin_points, normalized_data_unitybins, p0=[1,1],maxfev = 1000000)
fitData=pdf(bin_points, *fitParams)
ax1.plot(bin_points, fitData,'g-')

enter image description here

aces
  • 185
  • 1
  • 1
  • 10

1 Answers1

5

As you mention, you cannot use lognorm.fiton the binned data. So all you need to do is to restore the raw data from the histogram. Obviously this is not 'lossless', the more bins the better.

Sample code with some generated data:

import numpy as np
import scipy.stats as stats
import matplotlib.pylab as plt


# generate some data
ln = stats.lognorm(0.4,scale=100)
data = ln.rvs(size=2000)

counts, bins, _ = plt.hist(data, bins=50)
# note that the len of bins is 51, since it contains upper and lower limit of every bin

# restore data from histogram: counts multiplied bin centers
restored = [[d]*int(counts[n]) for n,d in enumerate((bins[1:]+bins[:-1])/2)]
# flatten the result
restored = [item for sublist in restored for item in sublist]

print stats.lognorm.fit(restored, floc=0)

dist = stats.lognorm(*stats.lognorm.fit(restored, floc=0))
x = np.arange(1,400)
y = dist.pdf(x)

# the pdf is normalized, so we need to scale it to match the histogram
y = y/y.max()
y = y*counts.max()

plt.plot(x,y,'r',linewidth=2)
plt.show()

fitted histogram

Christian K.
  • 2,785
  • 18
  • 40
  • I can't provide a response from Python, but this post was linked in the stats.stackexchange.com forum, and my statistical perspective on this approach is that it is biased and won't work. The shape of the distribution at any binned histogram is asymmetric, so imputing the midpoint doesn't work. The computational approach to follow is the EM algorithm, alternating between initial conditions for the parameters, and imputing the responses in bins based on quantiles of the distribution for the sample size $n$, until convergence (maximum likelihood). I can show it in R. – AdamO Feb 15 '23 at 05:44