I have a number of samples of a variable. I would like to use these samples to plot the probability distribution of the variable. I'm using kernel density estimation with a Gaussian kernel. I'm using sklearn
library for this purpose. Here is the sample code I have implemented:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KernelDensity
# -- data
init_range = 0.0793
X = np.random.uniform(low=-init_range, high=init_range, size=133280)[:, np.newaxis]
# -- kernel density estimation
kde = KernelDensity(kernel="gaussian", bandwidth=0.2).fit(X)
X_plot = np.linspace(min(X).item(), max(X).item(), 1000)[:, np.newaxis]
log_dens = kde.score_samples(X_plot)
# -- plot density
plt.plot( X_plot[:, 0], np.exp(log_dens), lw=2, linestyle="-")
plt.ylim([0, 2.1])
plt.show()
Below is the resulting output:
As you can see, the value on the y axis is above one. Hence, the y axis is NOT showing the probability distribution. I further plotted the histogram for this data:
# -- plot hist
n_bins = 40
weights = np.ones_like(X) / float(len(X))
prob, bins, _ = plt.hist(X, n_bins, density=False, histtype='step', color='red', weights=weights)
plt.show()
and the result is below:
which makes sense as the bins sum up to one: 0.025*40=1
I'm having a hard time understanding why my kde plot is not a distribution. How can I fix this? Is there a normalization step that I'm missing?