How to calculate the Kolmogorov-Smirnov statistic between two weighted samples

Question

Let's say that we have two samples data1 and data2 with their respective weights weight1 and weight2 and that we want to calculate the Kolmogorov-Smirnov statistic between the two weighted samples.

The way we do that in python follows:

import numpy as np

def ks_w(data1,data2,wei1,wei2):
    ix1=np.argsort(data1)
    ix2=np.argsort(data2)
    wei1=wei1[ix1]
    wei2=wei2[ix2]
    data1=data1[ix1]
    data2=data2[ix2]
    d=0.
    fn1=0.
    fn2=0.
    j1=0
    j2=0
    j1w=0.
    j2w=0.
    while(j1<len(data1))&(j2<len(data2)):
        d1=data1[j1]
        d2=data2[j2]
        w1=wei1[j1]
        w2=wei2[j2]
        if d1<=d2:
            j1+=1
            j1w+=w1
            fn1=(j1w)/sum(wei1)
        if d2<=d1:
            j2+=1
            j2w+=w2
            fn2=(j2w)/sum(wei2)
        if abs(fn2-fn1)>d:
            d=abs(fn2-fn1)
    return d

where we just modify to our purpose the classical two-sample KS statistic as implemented in Press, Flannery, Teukolsky, Vetterling - Numerical Recipes in C - Cambridge University Press - 1992 - pag.626.

Our questions are:

is anybody aware of any other way to do it?
is there any library in python/R/* that performs it?
what about the test? Does it exist or should we use a reshuffling procedure in order to evaluate the statistic?

If you haven't done so already I would suggest asking this question on http://stats.stackexchange.com/ on account of its statistical content. (Both scipy and matlab seem to have what you need only in the unweighted forms.) — Bill Bell, Oct 14 '16 at 15:55
@Bill Bell, thank you for your answer. On [stats.stackexchange.com](http://stats.stackexchange.com) we only found an old [post](http://stats.stackexchange.com/questions/193439/two-sample-kolmogorov-smirnov-test-with-weights) without answers. — Luca Jokull, Oct 15 '16 at 14:33

score 8 · Answer 1 · edited Jul 11 '20 at 00:18

This solution is based on the code for scipy.stats.ks_2samp and runs in about 1/10000 the time (notebook):

import numpy as np

def ks_w2(data1, data2, wei1, wei2):
    ix1 = np.argsort(data1)
    ix2 = np.argsort(data2)
    data1 = data1[ix1]
    data2 = data2[ix2]
    wei1 = wei1[ix1]
    wei2 = wei2[ix2]
    data = np.concatenate([data1, data2])
    cwei1 = np.hstack([0, np.cumsum(wei1)/sum(wei1)])
    cwei2 = np.hstack([0, np.cumsum(wei2)/sum(wei2)])
    cdf1we = cwei1[[np.searchsorted(data1, data, side='right')]]
    cdf2we = cwei2[[np.searchsorted(data2, data, side='right')]]
    return np.max(np.abs(cdf1we - cdf2we))

Here's a test of its accuracy and performance:

ds1 = np.random.rand(10000)
ds2 = np.random.randn(40000) + .2
we1 = np.random.rand(10000) + 1.
we2 = np.random.rand(40000) + 1.

ks_w2(ds1, ds2, we1, we2)
# 0.4210415232236593
ks_w(ds1, ds2, we1, we2)
# 0.4210415232236593

%timeit ks_w2(ds1, ds2, we1, we2)
# 100 loops, best of 3: 17.1 ms per loop
%timeit ks_w(ds1, ds2, we1, we2)
# 1 loop, best of 3: 3min 44s per loop

Tim Williams · Answer 2 · 2021-05-21T15:11:34.740

To add to Luca Jokull's answer, if you want to also return a p-value (similar to the unweighted scipy.stats.ks_2samp function), the suggested ks_w2() function can be modified as follows:

from scipy.stats import distributions

def ks_weighted(data1, data2, wei1, wei2, alternative='two-sided'):
    ix1 = np.argsort(data1)
    ix2 = np.argsort(data2)
    data1 = data1[ix1]
    data2 = data2[ix2]
    wei1 = wei1[ix1]
    wei2 = wei2[ix2]
    data = np.concatenate([data1, data2])
    cwei1 = np.hstack([0, np.cumsum(wei1)/sum(wei1)])
    cwei2 = np.hstack([0, np.cumsum(wei2)/sum(wei2)])
    cdf1we = cwei1[np.searchsorted(data1, data, side='right')]
    cdf2we = cwei2[np.searchsorted(data2, data, side='right')]
    d = np.max(np.abs(cdf1we - cdf2we))
    # calculate p-value
    n1 = data1.shape[0]
    n2 = data2.shape[0]
    m, n = sorted([float(n1), float(n2)], reverse=True)
    en = m * n / (m + n)
    if alternative == 'two-sided':
        prob = distributions.kstwo.sf(d, np.round(en))
    else:
        z = np.sqrt(en) * d
        # Use Hodges' suggested approximation Eqn 5.3
        # Requires m to be the larger of (n1, n2)
        expt = -2 * z**2 - 2 * z * (m + 2*n)/np.sqrt(m*n*(m+n))/3.0
        prob = np.exp(expt)
    return d, prob

This is the asymptotic method that scipy's original unweighted function uses.

Thanks for this. According to my tests, the two-sided p-values for equi-weightings agree, as it should, with the original unweighted `scipy.stats.kstest`. — GCru, Jun 03 '21 at 15:30

score 2 · Answer 3 · answered Apr 13 '19 at 09:58

This is a R version of a two-tails weighted KS statistic following the suggestion of Numerical Methods of Statistics by Monohan, pg. 334 in 1E and pg. 358 in 2E.

ks_weighted <- function(vector_1,vector_2,weights_1,weights_2){
    F_vec_1 <- ewcdf(vector_1, weights = weights_1, normalise=FALSE)
    F_vec_2 <- ewcdf(vector_2, weights = weights_2, normalise=FALSE)
    xw <- c(vector_1,vector_2) 
    d <- max(abs(F_vec_1(xw) - F_vec_2(xw)))

    ## P-VALUE with NORMAL SAMPLE 
    # n_vector_1 <- length(vector_1)                                                           
    # n_vector_2<- length(vector_2)        
    # n <- n_vector_1 * n_vector_2/(n_vector_1 + n_vector_2)

    # P-VALUE EFFECTIVE SAMPLE SIZE as suggested by Monahan
    n_vector_1 <- sum(weights_1)^2/sum(weights_1^2)
    n_vector_2 <- sum(weights_2)^2/sum(weights_2^2)
    n <- n_vector_1 * n_vector_2/(n_vector_1 + n_vector_2)


    pkstwo <- function(x, tol = 1e-06) {
                if (is.numeric(x)) 
                    x <- as.double(x)
                else stop("argument 'x' must be numeric")
                p <- rep(0, length(x))
                p[is.na(x)] <- NA
                IND <- which(!is.na(x) & (x > 0))
                if (length(IND)) 
                    p[IND] <- .Call(stats:::C_pKS2, p = x[IND], tol)
                p
            }

    pval <- 1 - pkstwo(sqrt(n) * d)

    out <- c(KS_Stat=d, P_value=pval)
    return(out)
}

Where is the `ewcdf` function located? – Noah Sep 04 '19 at 03:43 — Noah, Sep 04 '19 at 03:43

How to calculate the Kolmogorov-Smirnov statistic between two weighted samples

3 Answers3

Linked