Estimating probability density in a range between two x values on simulated data

Question

I want to evaluate the probability density of data I've simulated.

I know that if I simply want to find the probability density for a single x value on the normal distribution, I could use dnorm() in the following way:

dist_mean <- 10
dist_sd <- 0.2
prob_density_on_x_val <- dnorm(x = 9.9,
                               mean = dist_mean,
                               sd = dist_sd)

prob_density_on_x_val

[1] 1.760327

But what if I want to evaluate the probability density for a range between two x values in a simulated data?

dist_mean <- 10
dist_sd <- 0.2

## simulate 100,000 values from the normal distribution, 
## given specific mean and standard deviation.
set.seed(123)
random_vals <- rnorm(n = 100000,
                     mean = dist_mean,
                     sd = dist_sd)


hist(random_vals)

My 100,000 generated values are raw values, and they do take a normal shape. However, this is not a probability density function, as the area under the curve isn't equal to 1.

library("pracma")
trapz(random_vals)

random_vals
[1] 1000009

My questions:

Given my simulated data, how can I create a probability density function for it?
Once created, how could I estimate the: (1) probability under the curve, and (2) probability density on the curve, for a range between two x values? For example, the probability and probability density between x=9.7 and 10.2. Or any other range.

My attempt to figure this out:

In this comment, @Glen_b says using ecdf() is the way to calculate probability in a range between two x values "a" and "b": ecdf(b)-ecdf(a). However, something doesn't make sense, because:

cdf <- ecdf(random_vals)
range_density <- cdf(10.2)-cdf(9.7)

range_density
[1] 0.77358

How is it possible that the probability density on a point value (x=9.9) was 1.76, but for a range 9.7<x<10.2 it's smaller (0.77)? Both distributions (both the one defined with dnorm and the one simulated with rnorm) have the same mean and sd.

So I think I'm missing something fundamental, and would be grateful for any help. Overall, it seems like a very simple question, but I can't find a straightforward solution despite a lot of reading and digging.

Thanks!

Edit

The thing I was missing was the distinction between:

probability for a range of x values: the area under the curve of the pdf
probability density for a specific x value: the function's value for a given x value (this is what dnorm() is useful for)
probability density for a range along the pdf curve, between two x values (chosen answer + comments addresses that)

AkselA · Answer 1 · 2019-08-04T11:45:12.287

3

It doesn't make sense to calculate the probability for a single value in a continuous probability function, it is by definition zero, but you can calculate relative likelihoods. You react on random_vals not summing to one, but not that prob_density_on_x_val is more than one?

Glen is of course correct in that ecdf() is the way to go for a non-parametric estimate, but if you expect a normal distribution you can also do a parametric estimate.

dist_mean <- 10
dist_sd <- 0.2
a <- 9.7
b <- 10.2

set.seed(123)
r <- rnorm(1e4, dist_mean, dist_sd)

# population
pnorm(b, dist_mean, dist_sd) - pnorm(a, dist_mean, dist_sd)
# [1] 0.7745375

# parametric estimate
pnorm(b, mean(r), sd(r)) - pnorm(a, mean(r), sd(r))
# [1] 0.7753985

# nonparametric estimate
ecdfun <- ecdf(r)
ecdfun(b) - ecdfun(a)
# [1] 0.7754

edited Aug 04 '19 at 11:45

answered Aug 04 '19 at 11:33

AkselA

8,153
2
21
34

should that not read "continuous probability function" instead of "discrete" in your first sentence? – jludewig Aug 04 '19 at 11:43
@jludewig: Ah, yes – AkselA Aug 04 '19 at 11:44
@AkselA, thanks. True, I completely overlooked the p > 1 on a single value. However, how come `dnorm` did give me a value > 1, and not ≈ 0? – Emman Aug 04 '19 at 12:41
1

OK I think I get it. `dnorm` simply gives the value of the function for a given x, not the area under the curve for that x (which is basically nothing for a single value). – Emman Aug 04 '19 at 12:47

G5W · Accepted Answer · 2019-08-04T12:58:45.000

2

You can get the probability density function using the functions density and approxfun.

DensityFunction = approxfun(density(random_vals), rule=2)
DensityFunction(9.7)
[1] 0.6410087
plot(DensityFunction, xlim=c(9,11))

You can get the area under the curve using integrate

AreaUnderCurve = function(lower, upper) {
    integrate(DensityFunction, lower=lower, upper=upper) }

AreaUnderCurve(10,11)
0.5006116 with absolute error < 6.4e-05
AreaUnderCurve(9.5,10.5)
0.9882601 with absolute error < 0.00011

You also ask:

How is it possible that that the probability density on point value (x=9.9) was 1.76, but for a range 9.7

The value of the pdf (1.76) is the height of the curve. The value you are getting for the range is the area under the curve. Since the width of the interval is 0.5, it is not a surprise that the area under the curve is less than the height.

edited Aug 04 '19 at 12:58

answered Aug 04 '19 at 11:45

G5W

36,531
10
47
80

thanks to you I understand the subtlety between _density_ (height of curve for val x) and _probability_ (area under the curve for a range). However, and it might be out of the scope of this page -- can we also calculate the *density* for a range, rather than probability for a range (which is your answer)? i.e., can we measure the curve between two x values to get the range density? Does that even make sense? – Emman Aug 04 '19 at 13:11
1

You could get the _average_ density over a range by computing `AreaUndercurve(range) / width(range)` – G5W Aug 04 '19 at 13:20

Estimating probability density in a range between two x values on simulated data

My questions:

My attempt to figure this out:

Edit

2 Answers2