Plotting probability density function by sample with matplotlib

Question

I want to plot an approximation of probability density function based on a sample that I have; The curve that mimics the histogram behaviour. I can have samples as big as I want.

I don't understand how could somebody vote down this question?! I mean based on what??? — Cupitor, Mar 15 '13 at 14:27
usually on [SO] people will upvote questions that are immediately clear and also show some attempt by the asker to answer their own question. "What have you tried?" Usually downvotes are accompanied by comments though, so I'm not sure why that didn't happen in this case. — askewchan, Mar 15 '13 at 15:27
I see. Thanks for explanation... Sometimes these things make me think democracy sucks! — Cupitor, Mar 15 '13 at 15:53
heh, yeah. the [faq] are pretty useful for outlining what people expect to be (and not to be) in a question. And aside from 'reputation' more upvotes will make your questions get more visibility and attention. — askewchan, Mar 15 '13 at 16:03
thanks. I will try to read it :) That is also true! I will try to be more clear the next time! — Cupitor, Mar 15 '13 at 16:07

askewchan · Accepted Answer · 2018-02-20T01:25:41.617

43

If you want to plot a distribution, and you know it, define it as a function, and plot it as so:

import numpy as np
from matplotlib import pyplot as plt

def my_dist(x):
    return np.exp(-x ** 2)

x = np.arange(-100, 100)
p = my_dist(x)
plt.plot(x, p)
plt.show()

If you don't have the exact distribution as an analytical function, perhaps you can generate a large sample, take a histogram and somehow smooth the data:

import numpy as np
from scipy.interpolate import UnivariateSpline
from matplotlib import pyplot as plt

N = 1000
n = N//10
s = np.random.normal(size=N)   # generate your data sample with N elements
p, x = np.histogram(s, bins=n) # bin it into n = N//10 bins
x = x[:-1] + (x[1] - x[0])/2   # convert bin edges to centers
f = UnivariateSpline(x, p, s=n)
plt.plot(x, f(x))
plt.show()

You can increase or decrease s (smoothing factor) within the UnivariateSpline function call to increase or decrease smoothing. For example, using the two you get: dist to func

edited Feb 20 '18 at 01:25

answered Mar 14 '13 at 17:01

askewchan

45,161
17
118
134

that doesn't help in my case. I already wrote my sampling function and it is not exact for samples of size one lets say! – Cupitor Mar 14 '13 at 17:04
Then I think you should edit your question to be more clear. This answers your question assuming you "have the distribution". – askewchan Mar 14 '13 at 17:05
Thank you. But I get the following error: raise ValueError("x and y arrays must be equal in length along " ValueError: x and y arrays must be equal in length along interpolation axis. – Cupitor Mar 14 '13 at 17:14
1

@Naji Sorry about that, it should work now, with a working example of a normal distribution. – askewchan Mar 14 '13 at 17:30
I still get the following error: f = UnivariateSpline(x, 0.5, s=200) File "/Library/Python/2.7/site-packages/scipy/interpolate/fitpack2.py", line 143, in __init__ xb=bbox[0],xe=bbox[1],s=s) dfitpack.error: failed in converting 2nd argument `y' of dfitpack.fpcurf0 to C/Fortran array – Cupitor Mar 14 '13 at 17:45
`UnivariateSpline` takes two lists or arrays, `x` and `y` which must have the same shape. You've given it `x` and `0.5`, so they're not the same shape. I've used `p` and `x` where `p` is the probability of finding `x` (plus or minus dx). `p` is basically your histogram height, or probability distribution, which you said you could generate. – askewchan Mar 14 '13 at 17:49
let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/26182/discussion-between-askewchan-and-naji) – askewchan Mar 14 '13 at 17:50
1

you should use n =int( N/10) to avoid error from float type – Ajay Ohri Feb 19 '18 at 08:51
1

Good point @Ajay, I should update this! When I wrote this five years ago, `n` was an `int` because I was using python 2, and most of the audience probably was too. – askewchan Feb 20 '18 at 01:23

score 29 · Answer 2 · edited Aug 04 '16 at 10:06

29

What you have to do is to use the gaussian_kde from the scipy.stats.kde package.

given your data you can do something like this:

from scipy.stats.kde import gaussian_kde
from numpy import linspace
# create fake data
data = randn(1000)
# this create the kernel, given an array it will estimate the probability over that values
kde = gaussian_kde( data )
# these are the values over wich your kernel will be evaluated
dist_space = linspace( min(data), max(data), 100 )
# plot the results
plt.plot( dist_space, kde(dist_space) )

The kernel density can be configured at will and can handle N-dimensional data with ease. It will also avoid the spline distorsion that you can see in the plot given by askewchan.

enter image description here

edited Aug 04 '16 at 10:06

Alessandro Jacopson

18,047
15
98
153

answered Mar 14 '13 at 19:39

EnricoGiampieri

5,947
1
27
26

I am looking for a similar solution. I have a data-set already but I do not know what distribution does it have so I am trying to plot a Probability distribution function using python and I dont happen to know how to plot that. Any help is appreciated in that case. – Sitz Blogz Mar 16 '16 at 06:44
2

@SitzBlogz Let's say your data-set is called `data`, then just remove the line `data = randn(1000)` in @EnricoGiampieri answer and you're done! – Alessandro Jacopson Aug 04 '16 at 10:09

Plotting probability density function by sample with matplotlib

2 Answers2

Linked