4

I was working with a dataset that consists of two different groups of observations where the value is an integer. I wanted to plot the density of these to get a sense for how the different groups are distributed over the values.

What happened was one group had a 'smooth' density while the other had a 'wavy' density. I know this has something to do with bandwidth and the fact that my data is basically tied to discrete observations but I would love if someone can explain exactly why.

Here's an example:

data2 <- rbind(
    data.frame(group=rep('poisson1', 1000), value = rpois(1000, 5)),
    data.frame(group=rep('poisson2', 1000), value = rpois(1000, 45)))

library(ggplot2)
ggplot(data2, aes(x=value, fill=group)) +
  geom_density()

enter image description here

And strangely, I can create that dataframe again to get a new sample, and the plot sometimes is smooth: enter image description here

pogibas
  • 27,303
  • 19
  • 84
  • 117
JasonAment
  • 223
  • 2
  • 7
  • Look at the docs for `geom_density`. There are a number of arguments that get passed to the underlying `density` function, including the kernel type and bandwidth – camille Jun 06 '18 at 17:35
  • Yes, I did read the docs for geom_density, which lead me to stats::density, and I'm sure it's related to the fact that the x values are discrete and the chosen bandwidth. But I'm still not entirely clear on why, especially when I can see the same thing with a plot of a single sample of 1000 draws from rpois with lambda = 5. Most of the time the density plot is smooth, but every once in a while it's not. I'm hoping someone can explain why in a way I can understand. – JasonAment Jun 06 '18 at 19:09

1 Answers1

3

Observed smoothness (or lack of smoothness) is "caused" by rpois() function. lambda argument in rpois() function has to be non-negative mean of wanted random distribution. Therefore, when you pass lambda that is closer to zero (rpois(1000, 5)) it will generate less unique values (as it's bounded by zero).

Consider this example:

nValue <- 1e3
nLambda <- c(1:9, seq(10, 100, 10))

foo <- lapply(nLambda, function(lambda) {
    data.frame(value = rpois(nValue, lambda), lambda)
})
data <- do.call(rbind, foo)
ggplot(data, aes(value, group = lambda, color = lambda)) +
    geom_density()

enter image description here

We can see that lambda closer to zero will have peaks, while moving away from zero will generate more smooth lines.

You can also test this by looking into variance in each lambda group:

ggplot(aggregate(data$value, list(data$lambda), var), aes(Group.1, x)) +
    geom_line() +
    geom_point() +
    labs(x = "Lambda",
         y = "Variance")

enter image description here

pogibas
  • 27,303
  • 19
  • 84
  • 117
  • That makes some sense, but I can plot a sample of 1000 draws from a single rpois with lambda = 5 several times, and most of the time the density plot is smooth. Every once in a while I'll see one thats 'wavy.' I guess this is just surprising to me. – JasonAment Jun 06 '18 at 19:02
  • 1
    @JasonAment it's just a probability: if you sample 5 numbers with mean 5, then you can get `3,4,5,6,7` (smooth) or `5,5,5,5,5` (one peak). – pogibas Jun 06 '18 at 19:06
  • I understand there will be variation in a sample, but I'm still surprised that repeatedly sampling from rpois with n = 1000 and plotting the density of that sample usually produces a 'smooth', but always slightly varying density plot, but every once in a while produces a 'wavy' one. Beyond simple sampling variability, is there something special about how the bandwidth is being chosen and the fact that the x values are discrete that would explain this? – JasonAment Jun 07 '18 at 14:30
  • I doubt that it has something to do with bandwidth, if you would look at raw values in R you should observed some kind of low-uniqueness pattern – pogibas Jun 07 '18 at 14:32