I have a set of data values, and I want to get the CDF (cumulative distribution function) for that data set.
Since this is a continuous variable, we can't use binning approach as mentioned in (How to get cumulative distribution function correctly for my data in python?). So I came up with following approach.
import scipy.stats as st
def trapezoidal_2(ag, a, b, n):
h = np.float(b - a) / n
s = 0.0
s += ag(a)[0]/2.0
for i in range(1, n):
s += ag(a + i*h)[0]
s += ag(b)[0]/2.0
return s * h
def get_cdf(data):
a = np.array(data)
ag = st.gaussian_kde(a)
cdf = [0]
x = []
k = 0
max_data = max(data)
while (k < max_data):
x.append(k)
k = k + 1
sum_integral = 0
for i in range(1, len(x)):
sum_integral = sum_integral + (trapezoidal_2(ag, x[i - 1], x[i], 2))
cdf.append(sum_integral)
return x, cdf
This is how I use this method.
b = 1
data = st.pareto.rvs(b, size=10000)
data = list(data) x_cdf, y_cdf = get_cdf(data)
Ideally I should get a value close to 1 at the end of y_cdf list. But I get a value close to 0.57.
What is going wrong here? Is my approach correct?
Thanks.