9

Let's say that we have two samples data1 and data2 with their respective weights weight1 and weight2 and that we want to calculate the Kolmogorov-Smirnov statistic between the two weighted samples.

The way we do that in python follows:

import numpy as np

def ks_w(data1,data2,wei1,wei2):
    ix1=np.argsort(data1)
    ix2=np.argsort(data2)
    wei1=wei1[ix1]
    wei2=wei2[ix2]
    data1=data1[ix1]
    data2=data2[ix2]
    d=0.
    fn1=0.
    fn2=0.
    j1=0
    j2=0
    j1w=0.
    j2w=0.
    while(j1<len(data1))&(j2<len(data2)):
        d1=data1[j1]
        d2=data2[j2]
        w1=wei1[j1]
        w2=wei2[j2]
        if d1<=d2:
            j1+=1
            j1w+=w1
            fn1=(j1w)/sum(wei1)
        if d2<=d1:
            j2+=1
            j2w+=w2
            fn2=(j2w)/sum(wei2)
        if abs(fn2-fn1)>d:
            d=abs(fn2-fn1)
    return d

where we just modify to our purpose the classical two-sample KS statistic as implemented in Press, Flannery, Teukolsky, Vetterling - Numerical Recipes in C - Cambridge University Press - 1992 - pag.626.

Our questions are:

  • is anybody aware of any other way to do it?
  • is there any library in python/R/* that performs it?
  • what about the test? Does it exist or should we use a reshuffling procedure in order to evaluate the statistic?
Max Ghenis
  • 14,783
  • 16
  • 84
  • 132
Luca Jokull
  • 181
  • 2
  • 9
  • If you haven't done so already I would suggest asking this question on http://stats.stackexchange.com/ on account of its statistical content. (Both scipy and matlab seem to have what you need only in the unweighted forms.) – Bill Bell Oct 14 '16 at 15:55
  • @Bill Bell, thank you for your answer. On [stats.stackexchange.com](http://stats.stackexchange.com) we only found an old [post](http://stats.stackexchange.com/questions/193439/two-sample-kolmogorov-smirnov-test-with-weights) without answers. – Luca Jokull Oct 15 '16 at 14:33
  • Ah, well, nothing ventured nothing won. – Bill Bell Oct 15 '16 at 18:13

3 Answers3

8

This solution is based on the code for scipy.stats.ks_2samp and runs in about 1/10000 the time (notebook):

import numpy as np

def ks_w2(data1, data2, wei1, wei2):
    ix1 = np.argsort(data1)
    ix2 = np.argsort(data2)
    data1 = data1[ix1]
    data2 = data2[ix2]
    wei1 = wei1[ix1]
    wei2 = wei2[ix2]
    data = np.concatenate([data1, data2])
    cwei1 = np.hstack([0, np.cumsum(wei1)/sum(wei1)])
    cwei2 = np.hstack([0, np.cumsum(wei2)/sum(wei2)])
    cdf1we = cwei1[[np.searchsorted(data1, data, side='right')]]
    cdf2we = cwei2[[np.searchsorted(data2, data, side='right')]]
    return np.max(np.abs(cdf1we - cdf2we))

Here's a test of its accuracy and performance:

ds1 = np.random.rand(10000)
ds2 = np.random.randn(40000) + .2
we1 = np.random.rand(10000) + 1.
we2 = np.random.rand(40000) + 1.

ks_w2(ds1, ds2, we1, we2)
# 0.4210415232236593
ks_w(ds1, ds2, we1, we2)
# 0.4210415232236593

%timeit ks_w2(ds1, ds2, we1, we2)
# 100 loops, best of 3: 17.1 ms per loop
%timeit ks_w(ds1, ds2, we1, we2)
# 1 loop, best of 3: 3min 44s per loop
Max Ghenis
  • 14,783
  • 16
  • 84
  • 132
Luca Jokull
  • 181
  • 2
  • 9
4

To add to Luca Jokull's answer, if you want to also return a p-value (similar to the unweighted scipy.stats.ks_2samp function), the suggested ks_w2() function can be modified as follows:

from scipy.stats import distributions

def ks_weighted(data1, data2, wei1, wei2, alternative='two-sided'):
    ix1 = np.argsort(data1)
    ix2 = np.argsort(data2)
    data1 = data1[ix1]
    data2 = data2[ix2]
    wei1 = wei1[ix1]
    wei2 = wei2[ix2]
    data = np.concatenate([data1, data2])
    cwei1 = np.hstack([0, np.cumsum(wei1)/sum(wei1)])
    cwei2 = np.hstack([0, np.cumsum(wei2)/sum(wei2)])
    cdf1we = cwei1[np.searchsorted(data1, data, side='right')]
    cdf2we = cwei2[np.searchsorted(data2, data, side='right')]
    d = np.max(np.abs(cdf1we - cdf2we))
    # calculate p-value
    n1 = data1.shape[0]
    n2 = data2.shape[0]
    m, n = sorted([float(n1), float(n2)], reverse=True)
    en = m * n / (m + n)
    if alternative == 'two-sided':
        prob = distributions.kstwo.sf(d, np.round(en))
    else:
        z = np.sqrt(en) * d
        # Use Hodges' suggested approximation Eqn 5.3
        # Requires m to be the larger of (n1, n2)
        expt = -2 * z**2 - 2 * z * (m + 2*n)/np.sqrt(m*n*(m+n))/3.0
        prob = np.exp(expt)
    return d, prob

This is the asymptotic method that scipy's original unweighted function uses.

Tim Williams
  • 183
  • 1
  • 10
  • Thanks for this. According to my tests, the two-sided p-values for equi-weightings agree, as it should, with the original unweighted `scipy.stats.kstest`. – GCru Jun 03 '21 at 15:30
2

This is a R version of a two-tails weighted KS statistic following the suggestion of Numerical Methods of Statistics by Monohan, pg. 334 in 1E and pg. 358 in 2E.

ks_weighted <- function(vector_1,vector_2,weights_1,weights_2){
    F_vec_1 <- ewcdf(vector_1, weights = weights_1, normalise=FALSE)
    F_vec_2 <- ewcdf(vector_2, weights = weights_2, normalise=FALSE)
    xw <- c(vector_1,vector_2) 
    d <- max(abs(F_vec_1(xw) - F_vec_2(xw)))

    ## P-VALUE with NORMAL SAMPLE 
    # n_vector_1 <- length(vector_1)                                                           
    # n_vector_2<- length(vector_2)        
    # n <- n_vector_1 * n_vector_2/(n_vector_1 + n_vector_2)

    # P-VALUE EFFECTIVE SAMPLE SIZE as suggested by Monahan
    n_vector_1 <- sum(weights_1)^2/sum(weights_1^2)
    n_vector_2 <- sum(weights_2)^2/sum(weights_2^2)
    n <- n_vector_1 * n_vector_2/(n_vector_1 + n_vector_2)


    pkstwo <- function(x, tol = 1e-06) {
                if (is.numeric(x)) 
                    x <- as.double(x)
                else stop("argument 'x' must be numeric")
                p <- rep(0, length(x))
                p[is.na(x)] <- NA
                IND <- which(!is.na(x) & (x > 0))
                if (length(IND)) 
                    p[IND] <- .Call(stats:::C_pKS2, p = x[IND], tol)
                p
            }

    pval <- 1 - pkstwo(sqrt(n) * d)

    out <- c(KS_Stat=d, P_value=pval)
    return(out)
}