I want to evaluate the probability density of data I've simulated.
- I know that if I simply want to find the probability density for a single x value on the normal distribution, I could use
dnorm()
in the following way:
dist_mean <- 10
dist_sd <- 0.2
prob_density_on_x_val <- dnorm(x = 9.9,
mean = dist_mean,
sd = dist_sd)
prob_density_on_x_val
[1] 1.760327
- But what if I want to evaluate the probability density for a range between two x values in a simulated data?
dist_mean <- 10
dist_sd <- 0.2
## simulate 100,000 values from the normal distribution,
## given specific mean and standard deviation.
set.seed(123)
random_vals <- rnorm(n = 100000,
mean = dist_mean,
sd = dist_sd)
hist(random_vals)
- My 100,000 generated values are raw values, and they do take a normal shape. However, this is not a probability density function, as the area under the curve isn't equal to 1.
library("pracma")
trapz(random_vals)
random_vals
[1] 1000009
My questions:
- Given my simulated data, how can I create a probability density function for it?
- Once created, how could I estimate the: (1) probability under the curve, and (2) probability density on the curve, for a range between two x values? For example, the probability and probability density between x=9.7 and 10.2. Or any other range.
My attempt to figure this out:
In this comment, @Glen_b says using ecdf()
is the way to calculate probability in a range between two x values "a" and "b": ecdf(b)-ecdf(a)
. However, something doesn't make sense, because:
cdf <- ecdf(random_vals)
range_density <- cdf(10.2)-cdf(9.7)
range_density
[1] 0.77358
How is it possible that the probability density on a point value (x=9.9) was 1.76, but for a range 9.7<x<10.2
it's smaller (0.77)? Both distributions (both the one defined with dnorm
and the one simulated with rnorm
) have the same mean and sd.
So I think I'm missing something fundamental, and would be grateful for any help. Overall, it seems like a very simple question, but I can't find a straightforward solution despite a lot of reading and digging.
Thanks!
Edit
The thing I was missing was the distinction between:
- probability for a range of x values: the area under the curve of the pdf
- probability density for a specific x value: the function's value for a given x value (this is what
dnorm()
is useful for) - probability density for a range along the pdf curve, between two x values (chosen answer + comments addresses that)