2

Estimating the probability density function using the histogram by generating some random data. I want now two versions of histogram namely the equal bin width and equal bin height histograms.

# -*- coding: utf-8 -*-
from scipy.stats import norm
import matplotlib.pyplot as plt
#import pandas as pd
import numpy as np
fig, ax = plt.subplots(1, 1)

#Calculate a few first moments:
mean, var, skew, kurt = norm.stats(moments='mvsk')

#Display the probability density function (pdf):
x = np.linspace(norm.ppf(0.01),
                norm.ppf(0.99), 100)
ax.plot(x, norm.pdf(x),
       'r-', lw=5, alpha=0.6, label='norm pdf')

#Freeze the distribution and display the frozen pdf:
rv = norm()
ax.plot(x, rv.pdf(x), 'b-', lw=2, label='frozen pdf')

#Check accuracy of cdf and ppf:
vals = norm.ppf([0.001, 0.5, 0.999])
np.allclose([0.001, 0.5, 0.999], norm.cdf(vals))

#Generate random numbers:
r = norm.rvs(size=10000)

#df = pd.read_excel('ardata.xlsx')
#r = df[['dest','source']].values


#And compare the histogram:
ax.hist(r, density=True, histtype='stepfilled', alpha=0.2)
ax.legend(loc='best', frameon=False)
plt.show()

2 Answers2

0

If you want to generate histograms with equal bin width and bin height you cannot use random samples of the normal distribution (see the documentation of the rvs function). To achieve the desired goal, you need to take deterministic samples from the distribution. You could do for example:

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm


fig, ax = plt.subplots(1, 1)

# Display the probability density function (pdf):

xppf = np.linspace(norm.ppf(0.01),
                   norm.ppf(0.99), 100000)

ax.plot(xppf, norm.pdf(xppf, loc=0),
        'r-', lw=3, alpha=0.6, label='norm pdf')

# Create histogram:

mybins = np.linspace(norm.ppf(0.01), norm.ppf(0.99), num=12)  # Evenly spaced bins
myvals = np.linspace(0.01, 0.99, 100000)

ax.hist(norm.ppf(myvals, loc=0), bins=mybins, density=True,
        histtype='stepfilled', alpha=0.2)

ax.legend(loc='best', frameon=False)
plt.xlabel(r'x')
plt.ylabel(r'PDF(x)')
plt.show()

Which plots:

enter image description here

The obtained histogram will have evenly spaced bins (the 12 bins in the example are set with a linspace) and they will also have the same height because the sampling was deterministic (also a consequence of the use of the linspace).

panadestein
  • 1,241
  • 10
  • 21
  • I have samples in two columns in excel datasheet like: source and destination. how can I take P(x) value using the above code? – Humair Ali Palh Jun 12 '20 at 15:00
  • @HumairAliPalh glad to see that the answer helped you. You could parse the data in the datasheet with pandas [read_excel](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html) function. Something like `pd.read_excel('datasheet.xlsx', index_col=0)` and then take the values in the columns of the data frame. The code you had commented was on the right track. – panadestein Jun 12 '20 at 16:58
  • As I already parse my data using pandas. I am confused only in one thing which is `PDF(x)` how can I get these values using PDF !! – Humair Ali Palh Jun 12 '20 at 21:41
  • @HumairAliPalh use the norm.pdf(xdata) method with your data to get the values of the Probability Density Function. To get the height of the histogram bars, use the Percent Point Function (Quantile Function) `norm.ppf` – panadestein Jun 12 '20 at 22:29
0

From a given array r of samples, you could create an "even height histogram" as follows:

  • Sort the values.
  • Divide the sorted array into equal parts, e.g. 10 parts.
  • Use the values corresponding to these indices as the delimiters of some bars.
  • To get a normalized area of 1, the height times the width should be 1. As the width is just the range from the first to the last of the sorted elements, the height should be its inverse.

Running this multiple times results in somewhat different plots, as the random smallest and largest value will vary quite a bit.

s = np.sort(r)
bins = 10
ind = np.arange(bins + 1) * (s.size - 1) // bins
ax.bar(s[ind][:-1], 1/(s[-1] - s[0]), width=np.diff(s[ind]),
       color='g', alpha=0.4, ec='k', align='edge', zorder=-1, label='equal heights hist')

resulting plot

JohanC
  • 71,591
  • 8
  • 33
  • 66