4

This article describes an analytical approximation of normal CDF:

enter image description here

The approximation uses the arctangent function, which is also numerically approximated. I found some discussions about the algorithm of arctan functions in general, and it seems pretty convoluted. In comparison, the source code of pnorm() in R seems pretty straight forward, though it may not be as efficient.

Is there any computational advantage of using atan() instead of pnorm() in R, especially with large data and high parameter space when there is already a bunch of other numerical calculations based off the normal PDF already?

Thanks!

seamux
  • 129
  • 7

1 Answers1

3

Tried to look at it out of curiosity

First define the function

PNORM <- function(x) { 1/(exp(-358/23*x + 111*atan(37*x/294)) + 1) }

Then let us look at differences over the range of [-4, 4]

x <- seq(-4, 4, .01)
plot(x, pnorm(x)-PNORM(x), type="l", lwd=3, ylab="Difference")

which results in this graph

enter image description here

So the difference is small but maybe not small enough to ignore in some applications. YMMV. If we look at computing time then they are roughly equal with the approximation appearing to be slightly faster

> microbenchmark::microbenchmark(pnorm(x), PNORM(x))
Unit: microseconds
     expr    min      lq     mean  median      uq    max neval cld
 pnorm(x) 34.703 34.8785 36.54254 35.1820 38.3150 47.786   100   b
 PNORM(x) 24.293 24.4625 27.07660 24.8875 28.9035 59.216   100  a 
ekstroem
  • 5,957
  • 3
  • 22
  • 48
  • I got pretty similar results. While the computational advantage is evident with the function alone, I was, however, also interested in how each function performs with other numerical methods, so I compared the calculations of integral of `pnorm` and `PNORM` from `-Inf` to a vector of 10000 random numbers for 100 times. The results are a whopping 2.5% performance increase :( with average difference around 8e-5. I'm running an algorithm that takes hours per iteration, that's not a significant amount of time saving, at least not worth the difference when it's for percentage response variables. – seamux Aug 12 '17 at 05:30