Kolomogorov-Smirnov test: C to R translation issue

Question

I am having difficulty translating an algorithm from C to R. It's about Kolmogorov Smirnov test, and more specifically the KS probability function

In 'Numerical Recipes in C', 'probks', it's coded as

#include <math.h>
#define EPS1 0.001
#define EPS2 1.0e-8
float probks(float alam)
/*Kolmogorov-Smirnov probability function.*/
{
   int j;
   float a2,fac=2.0,sum=0.0,term,termbf=0.0;

   a2 = -2.0*alam*alam;
   for (j=1;j<=100;j++) {
      term=fac*exp(a2*j*j);
      sum += term;
      if (fabs(term) <= EPS1*termbf || fabs(term) <= EPS2*sum) return sum;
      fac = -fac; /*Alternating signs in sum.*/
      termbf=fabs(term);
   }
   return 1.0; /* Get here only by failing to converge. */
}

I don't know how to handle the translation in R of the few last lines, all I have nowe is

PROBKS <- function(lambda) {

  EPS1 <- 0.001; EPS2 <- 1.0e-8;
  sum <- 0.0; fac <- 2.0; termbf <- 0.0; 
  a2 <- -2*lambda*lambda 

  for (j in 1:100) {
    term <- fac * exp(a2*j*j)
    sum <- sum + term
    if ( (abs(term) <= EPS1*termbf) || (abs(term) <= EPS2*sum) ) {
      break
    } else {
      fac <- -fac
    }
  }
  termbf <- abs(term)
  return(sum)
}

but this produces a non-monotonic probability function

where it should be $Q_KS(0) = 1$ and $Q_KS(\infty) = 0$. Obviously, it's about how to interpret/encode the last 'if' statement.

Any help would be very appreciated. M

EDIT 1: Here my session info

> sessionInfo()
R version 3.4.4 (2018-03-15)
Platform: i386-w64-mingw32/i386 (32-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United Kingdom.1252 
[2] LC_CTYPE=English_United Kingdom.1252   
[3] LC_MONETARY=English_United Kingdom.1252
[4] LC_NUMERIC=C                           
[5] LC_TIME=English_United Kingdom.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] reshape2_1.4.3  forcats_0.3.0   stringr_1.3.1   dplyr_0.7.7    
 [5] purrr_0.2.5     readr_1.1.1     tidyr_0.8.1     tibble_1.4.2   
 [9] ggplot2_3.1.0   tidyverse_1.2.1

loaded via a namespace (and not attached):
 [1] withr_2.1.2      rvest_0.3.2      tidyselect_0.2.5 lattice_0.20-35 
 [5] pkgconfig_2.0.2  xml2_1.2.0       compiler_3.4.4   readxl_1.1.0    
 [9] Rcpp_0.12.19     cli_1.0.1        plyr_1.8.4       cellranger_1.1.0
[13] httr_1.3.1       tools_3.4.4      nlme_3.1-131.1   broom_0.5.0     
[17] R6_2.3.0         bindrcpp_0.2.2   bindr_0.1.1      scales_1.0.0    
[21] assertthat_0.2.0 gtable_0.2.0     stringi_1.1.7    rstudioapi_0.8  
[25] backports_1.1.2  hms_0.4.2        munsell_0.5.0    grid_3.4.4      
[29] colorspace_1.3-2 glue_1.3.0       lubridate_1.7.4  rlang_0.3.0.1   
[33] magrittr_1.5     lazyeval_0.2.1   yaml_2.2.0       crayon_1.3.4    
[37] haven_1.1.2      modelr_0.1.2     pillar_1.3.0     jsonlite_1.5

EDIT 2 Using Konrad's function ks_cdf and

x = seq(0, 1, by = 0.01)
plot(x, ks_cdf(x))

still gives 0 at 0

EDIT 3 After upgrading to 3.6.1

> sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
...

I still get the same plot as above, i.e. ks_cdf(0)=0 while it should be ks_sdf(0)=1

Just to clarify — (why) can’t you use the builtin function `ks.test`? Apart from that the error is a simple typo: the `termfb` update needs to be *inside* the loop, not outside (where it’s useless), and you changed the return value and, in doing so, aren’t respecting the lack of convergence. — Konrad Rudolph, Nov 12 '19 at 10:53
Btw, I am aware of the build in function. I need that for the 2D version of the KS test, which although is supported by an R package it comes without the significance calculation. — mjs, Nov 12 '19 at 11:36

Konrad Rudolph · Accepted Answer · 2019-11-12T16:47:25.187

5

The code can be translated into R almost literally — it’s not clear why you diverged from the C code without reason. Here’s a literal, slightly cleaned up translation:

ks_cdf = function (lambda) {
  EPS1 = 0.001
  EPS2 = 1.0e-8
  sum = 0
  fac = 2
  termbf = 0
  a2 = -2 * lambda ^ 2

  for (j in 1 : 100) {
    term = fac * exp(a2 * j ^ 2)
    sum = sum + term
    if ((abs(term) <= EPS1 * termbf) || (abs(term) <= EPS2 * sum)) {
      return(sum)
    } else {
      fac = -fac
      termbf = abs(term)
    }
  }
  1 # Failed to converge.
}

This code works but isn’t vectorised, which is something I’d change for a real implementation (but, by doing so, we’d lose the early exit).

Here’s an idiomatic R implementation using vectorised arithmetic and matrix multiplication:

ks_cdf = function (λ) {
  eps1 = 0.001
  eps2 = 1E-8

  range = seq(1, 100)
  terms = (-1) ^ (range - 1) * exp(-2 * range ^ 2 %*% t(λ ^ 2))
  sums = 2 * colSums(terms)
  pterms = abs(terms)
  prev_pterms = rbind(0, pterms[-nrow(pterms), , drop = FALSE])
  converged = apply(pterms <= eps1 * prev_pterms | pterms <= eps2 * sums, 2L, any)
  sums[! converged] = 1
  sums
}

And to show how nicely it vectorises, and that this is in fact a big deal:

x = seq(0, 1, by = 0.01)
plot(x, ks_cdf(x))

edited Nov 12 '19 at 16:47

answered Nov 12 '19 at 11:01

Konrad Rudolph

530,221
131
937
1,214

1

Thanks, I didn't know I can do 'return(sum)' within the loop! Yes, the vectorization is gone this way but it's not an issue, apart from not being able to plot the cdf in a simple way. Btw, the KS formula describes the cumulative distribution function (CDF) not the probability distribution function (PDF). – mjs Nov 12 '19 at 11:11
2

@mjs Regarding vectorisation, I disagree, in R you *should* regard this as an issue. See my updated answer for a vectorised function. Ideally one would get rid of the `apply` statements but I can’t think of a good way now. It can be replaced by a `colSums` but I refuse to write code that treats logical values as numerics, nor does it make the code much shorter. – Konrad Rudolph Nov 12 '19 at 11:28
Wow! Such code freaks me out, because I can never get my head around it, but it's really cool. Obviously I am not a R 'native speaker' :( – mjs Nov 12 '19 at 11:32
There seems to be an error in the vectorized version, ks_cdf(0)=0 whiole it should be ks_cdf(0)=1! – mjs Nov 12 '19 at 11:51
@mjs When I copy and paste the definition from my answer and run`ks_cdf(0)`, I get `1`, as expected. – Konrad Rudolph Nov 12 '19 at 12:07
I still get 0. The only difference I made in your code, I replaced '\lambda' with 'lambda'. Here how I use it for plotting: x <- seq(from = 0,to = 1,by = 1E-2); plot(x, ks_cdf(x),'b') - this shows a curve starting at 0 (as in the figure in the question). – mjs Nov 12 '19 at 12:20
@mjs What version of R are you using? Check my updated answer — this is the plot I’m getting for the (corrected) `plot` invocation under R 3.6.1. – Konrad Rudolph Nov 12 '19 at 12:41
`@Konrad` I added the session info and plotting results in my original post, under 3.4.4 I still get ks_cdf(0)=0 – mjs Nov 12 '19 at 13:08
`@Konrad Rudolph` same issues with R version 3.6.1 (2019-07-05), any idea what is going on? – mjs Nov 12 '19 at 14:22
@mjs That’s astonishing. The only cause that I can think of is that the difference might be because you’re running 32 bit, but even that seems unlikely (R numerics should be using the same double precision). Can you verify that `.Machine$double.digits` is 53? In general, the output from `.Machine` should be this: https://pastebin.com/34Phjphh. Alternatively, there might be a bug in some part of the Windows implementation but diagnosing this is going to be pretty hard. – Konrad Rudolph Nov 12 '19 at 14:24
`@Konrad Rudolph` I've tested all configurations 32/64bit and 3.4.4/3.6.1 - the error, ks_cdf(0)=0, is reproducible. .Machine$double.digits is 53 BUT .Machine shows three differences: $sizeof.long = 4; $sizeof.longdouble =12; $sizeof.pointer = 4 – mjs Nov 12 '19 at 14:40
@mjs Those shouldn’t matter, and they’re expected to be different on 32 bit. And since the issue also persists on 64 bit I’ll point fingers at Windows here. You could try debugging this and check which of the subexpressions in the `converged` calculations yields `TRUE` — they should all be `FALSE`. – Konrad Rudolph Nov 12 '19 at 14:48
@mjs Alright, please try the updated answer. I noticed a bug in the calculation. The bug doesn’t explain why the results differ for you and me, but it was nevertheless a genuine bug, and since it changes the implementation substantially maybe it fixes the issue. – Konrad Rudolph Nov 12 '19 at 15:25
`@Konrad Rudolph` we are almost there. It works for the x array, but single value evaluation, e.g. 'ks_cdf(0)' results in the following error message: 'Error in pterms <= eps1 * prev_pterms : non-conformable arrays' – mjs Nov 12 '19 at 16:41
@mjs LOL, sometimes I hate R: this error shouldn’t be allowed to occur. Thanks, fixed. – Konrad Rudolph Nov 12 '19 at 16:47
`@Konrad Rudolph` and the result is `ks_cdf(0) = 1`. Thanks so much for your help! BTW, is there still something to worry about because of the different results I got in Windows? – mjs Nov 12 '19 at 16:52
1

@mjs Not for this piece of code, I think. But in general that’s a whole different can of worms. I believe it’s due to the `lag` function, which I tried using here. Apart from the fact that this was wrong (we need the exact opposite, `lead`, which doesn’t exist in base R but does e.g. in dplyr), it’s usually used for time series and potentially does something weird with matrices. – Konrad Rudolph Nov 12 '19 at 16:56

Kolomogorov-Smirnov test: C to R translation issue

1 Answers1