0

For context, I'm trying to understand how to use np.cumsum to replicate scipy.stats.expon.cdf because I need to write a function compatible with scipy.stats.kstest which is not one of the distributions already available in scipy.stats.

I am having issues with finding resources online which give good guides to numerically computing the CDF, because the majority of resources just point you to in-built methods. I've gotten as far as the custom_exponential_cdf function defined below, which works well in some cases but breaks down in others (for example, the one below).

Does anyone know what I am doing wrong, and how I can modify custom_exponential_cdf to match the output from scipy.stats.expon.cdf?

import numpy as np
from matplotlib import pyplot as plt
from scipy.stats import kstest, expon

def custom_exponential_cdf(x, lamb):
    x = x.copy()
    x[x < 0] = 0.0
    pdf = lamb * np.exp(-lamb * x)
    cdf = np.cumsum(pdf * np.diff(np.concatenate(([0], x))))
    return cdf

unique_values = np.array([0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1, 0.11, 0.12, 0.13, 0.14, 0.15, 0.16, 0.17, 0.18, 0.19, 0.20, 0.21, 0.22, 0.23, 0.24, 0.25, 0.26, 0.27, 0.28, 0.29, 0.3, 0.31, 0.32, 0.33, 0.34, 0.35, 0.36, 0.37, 0.38, 0.39, 0.4, 0.41, 0.42, 0.43, 0.44, 0.45, 0.46, 0.47, 0.48, 0.49, 0.5, 0.51, 0.52, 0.53, 0.54, 0.55, 0.56, 0.57, 0.58, 0.59, 0.6, 0.61, 0.62, 0.63, 0.64, 0.65, 0.66, 0.67, 0.68, 0.69, 0.71, 0.72, 0.73, 0.74, 0.75, 0.76, 0.77, 0.78, 0.79, 0.85, 0.87, 0.89])
counts = np.array([1597, 1525, 1438, 1471, 1311, 1303, 1202, 1147, 1002, 918, 893, 801, 713, 680, 599, 578, 478, 430, 409, 353, 350, 292, 245, 211, 224, 182, 171, 151, 125, 111, 94, 85, 72, 73, 57, 36, 53, 35, 35, 27, 19, 20, 15, 10, 20, 12, 10, 13, 11, 10, 17, 15, 8, 3, 3, 3, 5, 6, 6, 2, 3, 3, 4, 6, 1, 1, 3, 1, 2, 1, 3, 1, 1, 2, 2, 2, 2, 2, 1, 1, 2])
x = np.repeat(unique_values, counts)
lamb = 9.23

fig, ax = plt.subplots()
ax.plot(x, expon.cdf(x, 0.0, 1.0 / lamb))
ax.plot(x, custom_exponential_cdf(x, lamb))

print(kstest(x, custom_exponential_cdf, (lamb,)))
print(kstest(x, "expon", (0.0, 1.0 / lamb)))

The output from the print statements is

KstestResult(statistic=0.08740741955472273, pvalue=6.857709296861777e-145, statistic_location=0.02, statistic_sign=-1)
KstestResult(statistic=0.0988550670723162, pvalue=2.7098163860110364e-185, statistic_location=0.04, statistic_sign=-1)

The output from the plot is:

A graph illustrating the difference between my function and scipy.stats.expon.cdf

John Coxon
  • 570
  • 1
  • 3
  • 15
  • You could try looking at the source code for `expon`: https://github.com/scipy/scipy/blob/v1.10.1/scipy/stats/_continuous_distns.py#L1616-L1722 – Sash Sinha Mar 03 '23 at 12:42

1 Answers1

1

The pdf of the exponential distribution is 0 for x < 0 and the custom function needs to reflect this. Here is one way to fix it:

import numpy as np
from matplotlib import pyplot as plt
from scipy.stats import kstest, expon

def custom_exponential_cdf(x, lamb):
    x = x.copy()
    x[x < 0] = 0
    pdf = lamb * np.exp(-lamb * x)
    cdf = np.cumsum(pdf * np.diff(np.concatenate(([0], x))))
    return cdf

x = np.sort(np.random.normal(loc=0.1, scale=0.09, size=21729))
lamb = 9.23

cdf = custom_exponential_cdf(x, lamb)
fig, ax = plt.subplots()
ax.plot(x, expon.cdf(x, 0.0, 1.0 / lamb), lw=7, alpha = 0.7, label="scipy")
ax.plot(x, cdf, 'r', label="custom")
plt.legend()

It gives:

enter image description here

bb1
  • 7,174
  • 2
  • 8
  • 23
  • Thank you very much for this! I've edited the question above with the new function, but it still doesn't work for the issue I'm having in my actual code, so I've reworked my MWE with the actual distribution that I'm having issues with. Unfortunately this version still doesn't match the `scipy.stats.expon.cdf` output for the data I'm actually working with, and I still don't really understand why… – John Coxon Mar 03 '23 at 14:00
  • @JohnCoxon The custom cdf function you wrote uses simple numerical integration to approximate the cdf (the area below the pdf curve is approximately the sum areas of rectangles). Since this is only an approximation, there is an error involved. This error will accumulate for higher values of the `x` array resulting in a growing discrepany. The exponential distribution has a simple analytical formula for cdf: `cdf = 1 - np.exp(- lamb *x)`, and if you use it instead of `np.cumsum` you will get more accurate results. – bb1 Mar 03 '23 at 14:47
  • @JohnCoxon In the first array `x` you used (generated from the normal distribution) spacing of values of this array is small, resulting in a better approximation. In the second array, different values are 0.01 apart, and the computational error is bigger. – bb1 Mar 03 '23 at 14:52