Efficiently find empirical density() for many arbitrary sample values (like dnorm(), but for empirical distribution)

Question

Say you've defined an empirical density (sample.density) for a sample of x.sample as in the following:

set.seed(1)
x.sample <- rnorm(100)
sample.density <- density(x.sample)

Now say that we have a gradient, G, for which we would like to know the expected densities

G <- seq(-2,2, length.out=20)

Based on the empirical distribution sample.density, what is the density of each value in G?

If I use a for() loop, I can get the answers like this:

G.dens <- c()
for(i in 1:length(G)){
    t.G <- G[i]
    G.dens[i] <- sample.density$y[which.min(abs(sample.density$x-t.G))]
}

The overarching idea is to do something like dnorm(), but instead of assuming that x is normally distributed with specified mean and s.d., I'd like to use a distribution determined empirically from an arbitrary sample (that isn't necessarily normal or any other well-described distribution that would be in the stats package).

I think this is what you're after: http://stackoverflow.com/questions/24957905/retrieve-y-value-from-density-function-of-given-x-value/24958001#24958001 — MrFlick, May 21 '15 at 20:20
Or this: http://stackoverflow.com/questions/28077500/find-the-probability-density-of-a-new-data-point-using-density-function-in-r/28078265#28078265 — MrFlick, May 21 '15 at 20:21
Or this: http://stackoverflow.com/questions/22476542/how-to-gain-a-function-of-an-estimated-density/22477079#22477079. What did you search for that didn't find these? Am I missing something? — MrFlick, May 21 '15 at 20:22
Yes, many points at once! In your first comment linking to your answer, it's like finding the density for many blue dots. Instead of `approxfun` I am using a `which.min`, which isn't vectorized. I'll be finding the density of a *lot* of points, so I want this to be a vectorized operation. note that my for() loop answer does give the correct answer, 1 point at a time, as in your linked questions. — rbatt, May 21 '15 at 20:23
`approxfun()` does return a vectorized function: `ds<-with(sample.density, approxfun(x,y)); ds(seq(-2,2, length.out=20))`. Do you not want linear interpolation? You can use `method="constant"` to get something more like your implementation. — MrFlick, May 21 '15 at 20:25
Ah, thank you! That third link provides the approach I needed ... I hadn't tried using `approxfun` actually b/c I wasn't familiar with it. Because that approximation is vectorized, it avoids the trouble of which.min, to find the closest matching element, because which.min cannot return several indices. Thanks. I'll just delete this question, as I think it duplicates that last link. I'd been searching for "empirical density" a lot. @MrFlick — rbatt, May 21 '15 at 20:31
@MrFlick I emphasized efficiency in the question title now, and provided an answer comparing speeds of the 3 approaches that I used. The one you suggested was the fastest. I don't think any of the answers you linked compared how to do this quickly, so I figure this contributes in a new way. — rbatt, May 21 '15 at 21:09

score 1 · Accepted Answer · answered May 21 '15 at 21:08

I think the comment by @MrFlick pointed me in the right direction. In addition to the suggested approxfun approach and my example for loop approach, I also realized I could use mapply. Note that my use of approxfun won't exactly match the result by the 2 other approaches which use which.min, but I'm not concerned with that difference in output too much, although others might be.

First, reproducing the sample data from the question:
set.seed(1)
x.sample <- rnorm(100)
sample.density <- density(x.sample)
G <- seq(-2,2, length.out=20)

Now, create a function for the loop version

loop()

loop <- function(x, ref){
    if(class(ref)!="density"){
        ref <- density(ref)
    }

    ref.y <- ref$y
    ref.x <- ref$x

    G.dens <- c()
    for(i in 1:length(x)){
        t.G <- x[i]
        G.dens[i] <- ref.y[which.min(abs(ref.x-t.G))]
    }

    G.dens
}

Next, I'll use the approach I came up with that uses mapply

dsample()

dsample <- function(x, ref){

    if(class(ref)!="density"){
        ref <- density(ref)
    }

    XisY <- function(x,y){ # which of several X values most closely matches a single Y value?
        which.min(abs(y-x))
    }

    ref.y <- ref$y
    ref.x <- ref$x

    # ds <- approxfun(ref)

    # ds(x)

    ref.y[mapply(XisY, x, MoreArgs=list(y=ref.x))]
}

Finally, the approach harnessing approxfun as suggested by @MrFlick:

af()

af <- function(x, ref){
    if(class(ref)!="density"){
        ref <- density(ref)
    }

    # XisY <- function(x,y){ # which of several X values most closely matches a single Y value?
    #   which.min(abs(y-x))
    # }

    ref.y <- ref$y
    ref.x <- ref$x

    ds <- approxfun(ref)

    ds(x)

    # ref.y[mapply(XisY, x, MoreArgs=list(y=ref.x))]
}

Now to compare their speed:

microbenchmark(
    loop(G, sample.density),
    dsample(G, sample.density),
    af(G, sample.density)
)
# Unit: microseconds
#                        expr     min       lq     mean  median       uq      max neval
#     loop(G, sample.density) 221.801 286.6675 360.3698 348.065 409.9785  942.071   100
#  dsample(G, sample.density) 252.641 290.7900 359.0432 368.388 417.1510  520.960   100
#       af(G, sample.density) 201.331 227.8740 276.0425 253.339 273.6225 2545.081   100

Now compare speed as the size of G increases:

speed.loop <- c()
speed.dsample <- c()
speed.af <- c()
lengths <- seq(20, 5E3, by=200)
for(i in 1:length(lengths)){
    G <- seq(-2,2, length.out=lengths[i])
    bm <- microbenchmark(
        loop(G, sample.density),
        dsample(G, sample.density),
        af(G, sample.density), times=10
    )

    means <- aggregate(bm$time, by=list(bm$expr), FUN=mean)[,"x"]/1E6 # in milliseconds
    speed.loop[i] <- means[1]
    speed.dsample[i] <- means[2]
    speed.af[i] <- means[3]


}

speed.ylim <- range(c(speed.loop, speed.dsample, speed.af))
plot(lengths, (speed.loop), ylim=(speed.ylim), type="l", ylab="Time (milliseconds)", xlab="# Elements in G")
lines(lengths, (speed.dsample), col="red")
lines(lengths, (speed.af), col="blue")

enter image description here