I want to plot an approximation of probability density function based on a sample that I have; The curve that mimics the histogram behaviour. I can have samples as big as I want.
Asked
Active
Viewed 1e+01k times
30
-
What is your sample? Is it a distribution, or actual data? – askewchan Mar 14 '13 at 16:58
-
1I don't understand how could somebody vote down this question?! I mean based on what??? – Cupitor Mar 15 '13 at 14:27
-
2usually on [SO] people will upvote questions that are immediately clear and also show some attempt by the asker to answer their own question. "What have you tried?" Usually downvotes are accompanied by comments though, so I'm not sure why that didn't happen in this case. – askewchan Mar 15 '13 at 15:27
-
I see. Thanks for explanation... Sometimes these things make me think democracy sucks! – Cupitor Mar 15 '13 at 15:53
-
heh, yeah. the [faq] are pretty useful for outlining what people expect to be (and not to be) in a question. And aside from 'reputation' more upvotes will make your questions get more visibility and attention. – askewchan Mar 15 '13 at 16:03
-
thanks. I will try to read it :) That is also true! I will try to be more clear the next time! – Cupitor Mar 15 '13 at 16:07
-
solved my problem – Toma Jul 05 '21 at 02:19
2 Answers
43
If you want to plot a distribution, and you know it, define it as a function, and plot it as so:
import numpy as np
from matplotlib import pyplot as plt
def my_dist(x):
return np.exp(-x ** 2)
x = np.arange(-100, 100)
p = my_dist(x)
plt.plot(x, p)
plt.show()
If you don't have the exact distribution as an analytical function, perhaps you can generate a large sample, take a histogram and somehow smooth the data:
import numpy as np
from scipy.interpolate import UnivariateSpline
from matplotlib import pyplot as plt
N = 1000
n = N//10
s = np.random.normal(size=N) # generate your data sample with N elements
p, x = np.histogram(s, bins=n) # bin it into n = N//10 bins
x = x[:-1] + (x[1] - x[0])/2 # convert bin edges to centers
f = UnivariateSpline(x, p, s=n)
plt.plot(x, f(x))
plt.show()
You can increase or decrease s
(smoothing factor) within the UnivariateSpline
function call to increase or decrease smoothing. For example, using the two you get:

askewchan
- 45,161
- 17
- 118
- 134
-
that doesn't help in my case. I already wrote my sampling function and it is not exact for samples of size one lets say! – Cupitor Mar 14 '13 at 17:04
-
Then I think you should edit your question to be more clear. This answers your question assuming you "have the distribution". – askewchan Mar 14 '13 at 17:05
-
Thank you. But I get the following error: raise ValueError("x and y arrays must be equal in length along " ValueError: x and y arrays must be equal in length along interpolation axis. – Cupitor Mar 14 '13 at 17:14
-
1@Naji Sorry about that, it should work now, with a working example of a normal distribution. – askewchan Mar 14 '13 at 17:30
-
I still get the following error: f = UnivariateSpline(x, 0.5, s=200) File "/Library/Python/2.7/site-packages/scipy/interpolate/fitpack2.py", line 143, in __init__ xb=bbox[0],xe=bbox[1],s=s) dfitpack.error: failed in converting 2nd argument `y' of dfitpack.fpcurf0 to C/Fortran array – Cupitor Mar 14 '13 at 17:45
-
`UnivariateSpline` takes two lists or arrays, `x` and `y` which must have the same shape. You've given it `x` and `0.5`, so they're not the same shape. I've used `p` and `x` where `p` is the probability of finding `x` (plus or minus dx). `p` is basically your histogram height, or probability distribution, which you said you could generate. – askewchan Mar 14 '13 at 17:49
-
let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/26182/discussion-between-askewchan-and-naji) – askewchan Mar 14 '13 at 17:50
-
1
-
1Good point @Ajay, I should update this! When I wrote this five years ago, `n` was an `int` because I was using python 2, and most of the audience probably was too. – askewchan Feb 20 '18 at 01:23
29
What you have to do is to use the gaussian_kde from the scipy.stats.kde package.
given your data you can do something like this:
from scipy.stats.kde import gaussian_kde
from numpy import linspace
# create fake data
data = randn(1000)
# this create the kernel, given an array it will estimate the probability over that values
kde = gaussian_kde( data )
# these are the values over wich your kernel will be evaluated
dist_space = linspace( min(data), max(data), 100 )
# plot the results
plt.plot( dist_space, kde(dist_space) )
The kernel density can be configured at will and can handle N-dimensional data with ease. It will also avoid the spline distorsion that you can see in the plot given by askewchan.

Alessandro Jacopson
- 18,047
- 15
- 98
- 153

EnricoGiampieri
- 5,947
- 1
- 27
- 26
-
I am looking for a similar solution. I have a data-set already but I do not know what distribution does it have so I am trying to plot a Probability distribution function using python and I dont happen to know how to plot that. Any help is appreciated in that case. – Sitz Blogz Mar 16 '16 at 06:44
-
2@SitzBlogz Let's say your data-set is called `data`, then just remove the line `data = randn(1000)` in @EnricoGiampieri answer and you're done! – Alessandro Jacopson Aug 04 '16 at 10:09