1

tldr: I am numerically estimating a PDF from simulated data and I need the density to monotonically decrease outside of the 'main' density region (as x-> infinity). What I have yields a close to zero density, but which does not monotonically decrease.


Detailed Problem

I am estimating a simulated maximum likelihood model, which requires me to numerically evaluate the probability distribution function of some random variable (the probability of which cannot be analytically derived) at some (observed) value x. The goal is to maximize the log-likelihood of these densities, which requires them to not have spurious local maxima.

Since I do not have an analytic likelihood function I numerically simulate the random variable by drawing the random component from some known distribution function, and apply some non-linear transformation to it. I save the results of this simulation in a dataset named simulated_stats.

I then use density() to approximate the PDF and approxfun() to evaluate the PDF at x:

#some example simulation
Simulated_stats_ <- runif(n=500, 10,15)+ rnorm(n=500,mean = 15,sd = 3)
#approximation for x
approxfun(density(simulated_stats))(x)

This works well within the range of simulated simulated_stats, see image: Example PDF. The problem is I need to be able to evaluate the PDF far from the range of simulated data.

So in the image above, I would need to evaluate the PDF at, say, x=50:

approxfun(density(simulated_stats))(50)
> [1] NA

So instead I use the from and to arguments in the density function, which correctly approximate near 0 tails, such

approxfun(
 density(Simulated_stats, from = 0, to = max(Simulated_stats)*10)
)(50)
[1] 1.924343e-18

Which is great, under one condition - I need the density to go to zero the further out from the range x is. That is, if I evaluated at x=51 the result must be strictly smaller. (Otherwise, my estimator may find local maxima far from the 'true' region, since the likelihood function is not monotonic very far from the 'main' density mass, i.e. the extrapolated region).

To test this I evaluated the approximated PDF at fixed intervals, took logs, and plotted. The result is discouraging: far from the main density mass the probability 'jumps' up and down. Always very close to zero, but NOT monotonically decreasing.

    a <- sapply(X = seq(from = 0, to = 100, by = 0.5), FUN = function(x){approxfun(
      density(Simulated_stats_,from = 0, to = max(Simulated_stats_)*10)
      )(x)})
    aa <- cbind( seq(from = 0, to = 100, by = 0.5), a)
    plot(aa[,1],log(aa[,2]))

Result: Non-monotonic log density far from density mass

My question

Does this happen because of the kernel estimation in density() or is it inaccuracies in approxfun()? (or something else?)

What alternative methods can I use that will deliver a monotonically declining PDF far from the simulated density mass?

Or - how can I manually change the approximated PDF to monotonically decline the further I am from the density mass? I would happily stick some linear trend that goes to zero...

Thanks!

P_sam
  • 13
  • 2

1 Answers1

0

One possibility is to estimate the CDF using a beta regression model; numerical estimate of the derivative of this model could then be used to estimate the pdf at any point. Here's an example of what I was thinking. I'm not sure if it helps you at all.

  1. Import libraries
library(mgcv)
library(data.table)
library(ggplot2)
  1. Generate your data
set.seed(123)
Simulated_stats_ <- runif(n=5000, 10,15)+ rnorm(n=500,mean = 15,sd = 3)
  1. Function to estimate CDF using gam beta regression model
get_mod <- function(ss,p = seq(0.02, 0.98, 0.02)) {
  qp = quantile(ss, probs=p)
  betamod = mgcv::gam(p~s(qp, bs="cs"), family=mgcv::betar())
  return(betamod)
}

betamod <- get_mod(Simulated_stats_)
  1. Very basic estimate of PDF at val given model that estimates CDF
est_pdf <- function(val, betamod, tol=0.001) {
  xvals  = c(val,val+tol)
  yvals = predict(betamod,newdata=data.frame(qp = xvals), type="response")
  as.numeric((yvals[1] - yvals[2])/(xvals[1] - xvals[2]))
}
  1. Lets check if monotonically increasing below min of Simulated_stats
test_x = seq(0,min(Simulated_stats_), length.out=1000)
pdf = sapply(test_x, est_pdf, betamod=betamod)
all(pdf == cummax(pdf))

[1] TRUE
  1. Lets check if monotonically decreasing above max of Simulated_stats
test_x = seq(max(Simulated_stats_), 60, length.out=1000)
pdf = sapply(test_x, est_pdf, betamod=betamod)
all(pdf == cummin(pdf))

[1] TRUE

Additional thoughts 3/5/22

As discussed in comments, using the betamod to predict might slow down the estimator. While this could be resolved to a great extent by writing your own predict function directly, there is another possible shortcut.

  1. Generate estimates from the betamod over the range of X, including the extremes
k <- sapply(seq(0,max(Simulated_stats_)*10, length.out=5000), est_pdf, betamod=betamod)
  1. Use the approach above that you were initially using, i.e. a linear interpolation across the density, but rather than doing this over the density outcome, instead do over k (i.e. over the above estimates from the beta model)
lin_int = approxfun(x=seq(0,max(Simulated_stats_)*10, length.out=5000),y=k)
  1. You can use the lin_int() function for prediction in the estimator, and it will be lighting fast. Note that it produces virtually the same value for a given x
c(est_pdf(38,betamod), lin_int(38))
[1] 0.001245894 0.001245968

and it is very fast

microbenchmark::microbenchmark(
  list = alist("betamod" = est_pdf(38, betamod),"lin_int" = lint(38)),times=100
)

Unit: microseconds
    expr    min      lq     mean  median      uq    max neval
 betamod 1157.0 1170.20 1223.304 1188.25 1211.05 2799.8   100
 lin_int    1.7    2.25    3.503    4.35    4.50   10.5   100

Finally, lets check the same plot you did before, but using lin_int() instead of approxfun(density(....))

a <- sapply(X = seq(from = 0, to = 100, by = 0.5), lin_int)
aa <- cbind( seq(from = 0, to = 100, by = 0.5), a)
plot(aa[,1],log(aa[,2]))

performance of lin_int at extremes

langtang
  • 22,248
  • 1
  • 12
  • 27
  • Thanks! This is very helpful. The main problem is that this procedure takes quite a while, making this difficult to implement in the estimator. I ran a quick test using microbenchmark for the two method, and the mean time using the approxfun is 1/50th that of the beta regression So I'm curious if anyone has other options. I will mark this as the answer if not. Thanks again! – P_sam Mar 03 '22 at 19:44
  • yeah, the construction of the function and the prediction of each point will take longer with my approach.. do you think there is any possibility in merging the approaches, using the longer prediction approach in the estimator only in the "low probability" area? – langtang Mar 03 '22 at 20:23
  • 1
    that's exactly what I'm trying to do right now! I'll determine some threshold from which I will use your approach. That does invalidate the combined function from being an actual pdf, but I think for the purposes of maximum likelihood it should be fine so long as there is no discontinuity at the threshold – P_sam Mar 03 '22 at 22:04
  • @P_sam How goes the progress? I'm probably spending too much time on this, but your question really sparked my interest. I was thinking a bit more today, and realized that a good option to the "slowness" of using the `betamod` in the estimator might be to use the linear interpolation approach directly on a large set of estimates of the betamod. The resulting function can be used in your estimator. See my addition to the solution – langtang Mar 06 '22 at 00:39
  • this is great! It's indeed lightning fast once you estimate the function lin_int! I think the only issue for me is that in my particular case I need to estimate betamod (and hence lin_tin) each time, since I am recalculating the likelihood pdf for each candidate parameter vector, so it still takes a long time :\ But absolutely if the goal is to approximate the pdf at multiple points this is a fantastic solution. – P_sam Mar 12 '22 at 00:37