0

I have generated random data using:

bkg= 240-140*np.random.power(3.5,50000)

I plotted the points into a histogram by using

h_all = plt.hist(all,bins=binedges,histtype='step')

My question is, provided that I know the pdf (in this case called "bkg") can I generate a curve using scipy.optimize that fits the points generated perfectly, and what equation it is for the curve ?

3kt
  • 2,543
  • 1
  • 17
  • 29
James J
  • 131
  • 8

1 Answers1

1

First of all, remark that your bkg is NOT a probability density function (pdf). Rather, it is a list of observations from a pdf. By calling matplotlib.pyplot.hist on this list of observations, you get to see a curve that approximates the (offset and scaled version of the) probability density function. If you are given this curve, it is possible to get a good estimation of the parameters needed to model this, provided you've been given the parameterized model a priori.

For example:

import matplotlib.pyplot as plt
import numpy as np
from scipy.optimize import curve_fit

offset, scale, a, nsamples = 240, -140, 3.5, 500000
bkg = offset + scale*np.random.power(a, nsamples)  # values range between (offset, offset+scale), which map to 0 and 1
nbins = 100

count, bins, ignored = plt.hist(bkg, bins=nbins, histtype='stepfilled', edgecolor='none')

reversed probability density function of the power function distribution

If now you are given the centers of these bins and the counts,

xdata = .5*(bins[1:]+bins[:-1])
ydata = count

and you are asked to find the parameters of the power distribution function that fits to this data (-> someone told you this, you trust that source), then you could go about in the following manner.

First, observe that the power distribution function P(x,a) is a monotonously increasing function (i.e. P(x1, a ) < P(x2, a) when 0 <= x1 < x2 <= 1). That means that the dataset given above has been flipped left-to-right, or that it represents factor*P(x, a ) with factor < 0.

Next, notice that the given data is not given over the interval [0,1], typical for a probability density function. That means that you should rescale the given xdata to the [0,1] interval prior to attempting to fit the power function distribution to it. Just by observing the graph, you figure out that the values that 0 and 1 map to are 100 and 240. However, this is just luck here, because matplotlib chose a sensible range for plotting. When you are confronted with not actually knowing the limits to which 0 and 1 have mapped to, you could choose the less optimal (but still very good) choice of xdata[0] - binwidth/2 and xdata[-1] + binwidth/2 or (a slightly worse choice) xdata[0] and xdata[-1]. From the previous paragraph, you know that 1 maps to xdata[0] - binwidth/2 :=: a and 0 maps to xdata[-1] + binwidth/2 :=: b. The linear map that does this is lambda x: (a - b)*x + b (simple algebra).

If you pass this to [0,1]-mapped version of the xdata to curve_fit, it'll give you a good guess for the exponent.

def get_model(nobservations, binwidth, scale, offset):
    def model(bin_centers, exponent):
        x = (bin_centers - offset)/scale
        y = exponent*x**(exponent - 1)
        normed_y = nobservations * binwidth * y / np.abs(scale)
        return normed_y
    return model

binwidth = np.diff(xdata)[0]
p0, _ = curve_fit(get_model(nsamples, binwidth, scale=-xdata.ptp() - binwidth, offset=xdata[-1] + binwidth/2), xdata, ydata)
print(p0)  # prints e.g.: 3.37117679

plt.plot(xdata, get_model(nsamples, binwidth, scale=-xdata.ptp() - binwidth, offset=xdata[-1] + binwidth/2)(xdata, *p0))

At this moment, you have found a rather accurate description of the distribution that was used to generate the observations of bkg:

f(x) = offset + scale*(exponent * x**(exponent - 1))
     = (xdata[-1] + binwidth/2) + (-xdata.ptp() - binwidth)*(p0[0] * x**(p0[0] - 1))
     ~ 234.85 - 1.34.85*(3.37 * x**(3.37 - 1))

By the way, I'd like to point out that replicating bkg (the observations from the distribution) as a perfect copy is something you can only do if you know the exact parameters of the distribution (240, -140 and 3.5) AND set the seed for the random number generation equal to the seed that was in effect prior to the initial call to np.random.power.

fit to reversed probability density function of the power function distribution

If you'd like to fit a curve to the histogram using splines, you should retrieve the knots and coefficients from the generated spline and pass those into the function of bspleval, as shown here. The topic of writing out those equations is a long one however, and there are numerous resources on the internet that you can check to understand how it's done. Needless to say, that function bspleval is what you'll need in case you want to go that route. If it were me, I'd go the route of curve fitting shown above.

Community
  • 1
  • 1
Oliver W.
  • 13,169
  • 3
  • 37
  • 50
  • I managed to generate a interpolated curve using spline. But I could not find the specification to get a simple spline function back. Also could you specify in this case how the offset power distribution should look like? I have tried using (1-x)**2.5 but that didnt work. – James J Jun 03 '16 at 15:04
  • @JamesJ, I've elaborated the approach of curve fitting using an offset and scaled distribution. I hope it's much clearer now. – Oliver W. Jun 07 '16 at 09:32