0

I have a set of data values, and I want to get the CDF (cumulative distribution function) for that data set.

Since this is a continuous variable, we can't use binning approach as mentioned in (How to get cumulative distribution function correctly for my data in python?). So I came up with following approach.

import scipy.stats as st

def trapezoidal_2(ag, a, b, n):
    h = np.float(b - a) / n
    s = 0.0
    s += ag(a)[0]/2.0
    for i in range(1, n):
        s += ag(a + i*h)[0]
    s += ag(b)[0]/2.0
    return s * h

def get_cdf(data):
    a = np.array(data)
    ag = st.gaussian_kde(a)

    cdf = [0]
    x = []
    k = 0

    max_data = max(data)

    while (k < max_data):
        x.append(k)
        k = k + 1

    sum_integral = 0

    for i in range(1, len(x)):
        sum_integral = sum_integral + (trapezoidal_2(ag, x[i - 1], x[i], 2))
        cdf.append(sum_integral)

    return x, cdf

This is how I use this method.

b = 1
data = st.pareto.rvs(b, size=10000)
data = list(data)    x_cdf, y_cdf = get_cdf(data)

Ideally I should get a value close to 1 at the end of y_cdf list. But I get a value close to 0.57.

What is going wrong here? Is my approach correct?

Thanks.

Pasindu Tennage
  • 1,480
  • 3
  • 14
  • 31

3 Answers3

3

The value of the cdf at x is the integral of the pdf between -inf and x, but you are computing it between 0 and x. Maybe you are assuming that the pdf is 0 for x < 0 but it is not:

rs = np.random.RandomState(seed=52221829)
b = 1
data = st.pareto.rvs(b, size=10000, random_state=rs)
ag = st.gaussian_kde(data)

x = np.linspace(-100, 100)
plt.plot(x, ag.pdf(x))

enter image description here

So this is probably what's going wrong here: you not checking your assumptions.

Your code for computing the integral is painfully slow, there are better ways to do this with scipy but gaussian_kde provides the method integrate_box_1d to integrate the pdf. If you take the integral from -inf everything looks right.

cdf = np.vectorize(lambda x: ag.integrate_box_1d(-np.inf, x))
plt.plot(x, cdf(x))

enter image description here

Integrating between 0 and x you get the same you are seeing now (to the right of 0), but that's not a cdf at all:

wrong_cdf = np.vectorize(lambda x: ag.integrate_box_1d(0, x))
plt.plot(x, wrong_cdf(x))

enter image description here

Stop harming Monica
  • 12,141
  • 1
  • 36
  • 56
  • that KDE doesn't look anything like a pareto distribution; there should be zero mass < 0. not sure if a gaussian KDE would ever work nicely with a pareto distribution, especially with alpha=1. try plotting `plt.plot(x, st.pareto.pdf(x, b))` on top of the estimate – Sam Mason Sep 10 '18 at 13:48
  • @SamMason You are right but I feel that I don't have enough context to frame-challenge the question or suggest improvements. It depends on the purpose and what can be safely assumed about the data. – Stop harming Monica Sep 10 '18 at 14:19
0

Not sure about why your function is not working exactly but one way of calculating CDF is as follows:

def get_cdf_1(data):

    # start with sorted list of data
    x = [i for i in sorted(data)]

    cdf = []

    for xs in x:
        # get the sum of the values less than each data point and store that value
        # this is normalised by the sum of all values
        cum_val = sum([i for i in data if i <= xs])/sum(data) 
        cdf.append(cum_val)

    return x, cdf

There is no doubt a faster way of computing this using numpy arrays rather than appending values to a list, but this returns values in the same format as your original example.

MattB
  • 520
  • 3
  • 12
  • This is correct for discrete data sets. For continuous data, we can't simply get the CDF like this – Pasindu Tennage Sep 09 '18 at 04:40
  • But in this case the data points are all known and therefore doesn't it become an approximation to a continuous data set by a discrete data set. In this case the values are all being used and no information is lots by 'binning' them. – MattB Dec 27 '19 at 11:30
0

I think it's just:

def get_cdf(data):
  return sorted(data), np.linspace(0, 1, len(data))

but I might be misinterpreting the question!

when I compare this to the analytic result I get the same:

x_cdf, y_cdf = get_cdf(st.pareto.rvs(1, size=10000))

import matplotlib.pyplot as plt
plt.semilogx(x_cdf, y_cdf)
plt.semilogx(x_cdf, st.pareto.cdf(x_cdf, 1))
Sam Mason
  • 15,216
  • 1
  • 41
  • 60
  • This is a discrete approximation of CDF. For continuous data, we can't use this approach I guess – Pasindu Tennage Sep 09 '18 at 04:42
  • 1
    it's the CDF of your (finite) set of samples, not an approximation. If you want an estimate of the underlying distribution you need to make assumptions about it, discrete/continuous, smoothness, finiteness of moments, etc. – Sam Mason Sep 10 '18 at 14:08
  • could you explain what you mean by "fail"? it certainly trades efficiency for simplicity, but I don't see how it would give an incorrect result (which is how I'm interpreting "fail") – Sam Mason Sep 14 '18 at 14:38