Find the probability density of a new data point using "density" function in R

Question

I am trying to find the best PDF of a continuous data that has unknown distribution, using the "density" function in R. Now, given a new data point, I want to find the probability density of this data point based on the kernel density estimator that I have from the "density" function result. How can I do that?

@whuber; You might caution the questioner that a _density_ is not a probability. (As I was first reading your answer I thought you were going to say that the answer was trivial because the probability at any point was (trivially) zero.) — IRTFM, Jan 21 '15 at 22:12
Is this a discrete distribution? For continuous distributions, the probability of observing any specific value is 0. Not sure why this was migrated here. Seems like the hangup is still on statistical understanding, not programming. — MrFlick, Jan 21 '15 at 22:21
@whuber Are you saying the obvious theoretical answer is 0? How does that make this a programming question. — MrFlick, Jan 21 '15 at 22:22
@MrFlick Understanding that by "probability" the OP really means "probability density," the obvious theoretical answer is the value of the kernel density estimate at the point--not zero. The question is about computing a kernel density estimate in `R`: that's what makes it a programming question. It's not a completely trivial programming question, either (although it does have a pretty simple solution), because `R` returns its KDE as an array of equally spaced values, so something in addition is needed in order to obtain the value for an arbitrary argument. — whuber, Jan 21 '15 at 22:25
@whuber I don't read it that way at all. The OP says they've already created the kernel density estimate with `density()` which is exactly the right way to do that. — MrFlick, Jan 21 '15 at 22:28
@MrFlick What `R` returns by default does not answer the question about finding the KDE "at a new data point". (I have consulted the help page for `density` to confirm that.) Some programming is needed: either a way to get `R` to supply the KDE at an arbitrary argument or a way to interpolate (and maybe *extrapolate*) from the array returned by `density`. (The "programming" probably amounts to setting the optional `n`, `from`, and `to` arguments to suitable values.) — whuber, Jan 21 '15 at 22:31

Glen_b · Answer 1 · 2015-01-21T22:54:35.987

If your new point will be within the range of values produced by density, it's fairly easy to do -- I'd suggest using approx (or approxfun if you need it as a function) to handle the interpolation between the grid-values.

Here's an example:

set.seed(2937107)
x <- rnorm(10,30,3)
dx <- density(x)
xnew <- 32.137
approx(dx$x,dx$y,xout=xnew)

If we plot the density and the new point we can see it's doing what you need:

enter image description here

This will return NA if the new value would need to be extrapolated. If you want to handle extrapolation, I'd suggest direct computation of the KDE for that point (using the bandwidth from the KDE you have).

Antoine · Answer 2 · 2017-02-20T07:36:35.207

8

This is one year old, but nevertheless, here is a complete solution. Let's call

d <- density(xs)

and define h = d$bw. Your KDE estimation is completely determined by

the elements of xs,
the bandwidth h,
the type of kernel functions.

Given a new value t, you can compute the corresponding y(t), using the following function, which assumes you have used Gaussian kernels for estimation.

myKDE <- function(t){
    kernelValues <- rep(0,length(xs))
    for(i in 1:length(xs)){
        transformed = (t - xs[i]) / h
        kernelValues[i] <- dnorm(transformed, mean = 0, sd = 1) / h
    }
    return(sum(kernelValues) / length(xs))
}

What myKDE does is it computes y(t) by the definition.

edited Feb 20 '17 at 07:36

answered Jan 08 '16 at 17:02

Antoine

862
7
22

what is `centri`? – Maxwell Chandler Feb 20 '17 at 03:33
this is great, I would give you more upvotes if I could. – Maxwell Chandler Feb 20 '17 at 07:44
@MaxwellChandler No need for that:) (But ... You can create a million of user accounts and then each of them can upvote the question:)) – Antoine Feb 20 '17 at 09:19
1

Just a bit simpler & faster equivalent: `function(xs, t, h = bw.nrd0(xs)) mean(dnorm(t, mean = xs, sd = h))`. – F. Privé Nov 09 '21 at 16:36

score -4 · Answer 3 · answered Jan 21 '15 at 21:59

-4

See: docs

dnorm(data_point, its_mean, its_stdev)

answered Jan 21 '15 at 21:59

bill_e

930
2
12
24

I know I can use those norm functions, however, they need the mean and sd. The point is that the result of the pdf can differ using "density" function while using different bandwidths, for the same mean and sd! – programmingIsFun Jan 21 '15 at 22:12
@programming You ought to clarify this point in your question: by asserting your data "has Gaussian distribution" you imply that you know or can estimate its parameters. Currently the reader has to deduce that you really don't have a Gaussian, but are using a *kernel density estimate* from your data. You need to be explicit about this. – whuber Jan 21 '15 at 22:15
Ah, I see. Yeah I didn't assume you were using a kde from your data. I upvoted @Glen_b's answer. – bill_e Jan 22 '15 at 00:10

Find the probability density of a new data point using "density" function in R

3 Answers3

Linked

Related