Interpretation of 2D density estimate charts

Question

I created the following charts in R using stat_density_2d() (left) and geom_density2d_filled() (right) respectively. Despite both charts looking visually identical, the levels are significantly different.

How can these values be contextualized or interpreted? For example, does the yellow area in the right chart cover 25% of the observations in the densest area and the cyan area cover 50% respectively. What is the relationship between these different levels?

library(ggplot2)

set.seed(123)
dat <-
  data.frame(
    X = c(rnorm(300, 3, 2.5), rnorm(150, 7, 2)),
    Y = c(rnorm(300, 6, 2.5), rnorm(150, 2, 2)))

ggplot(dat, aes(X, Y)) +
  stat_density_2d(geom = "polygon",
                  aes(fill = after_stat(level)), bins = 4) +
  geom_point(alpha = 0.1)

ggplot(dat, aes(X, Y)) +
  geom_density2d_filled(
    aes(fill = after_stat(level)),
    contour_var = "ndensity",
    breaks = seq(0.25, 1, length.out = 4)
  ) +
  geom_point(alpha = 0.1)

# EDIT to incorporate chart based on comment
ggplot(dat, aes(X, Y)) +
  geom_density2d_filled(
    aes(fill = after_stat(level)),
    contour_var = "density",
    bins = 4) +
  geom_point(alpha = 0.1)

Try `contour_var = "density"` rather than `contour_var = "ndensity"` See the `? geom_density_2d_filled` help page for a discussion of the differences. — MrFlick, Feb 28 '23 at 23:14
@MrFlick Indeed, you are right. The more natural comparison would be using `density` instead of `ndensity`. However, by using `density` with 4 bins, there is still an observable difference in levels between the two charts. Thus, I wonder why this is the case and if the `ndensity` case can be interpreted as mentioned above. I edited the corresponding code to incorporate this. — carl, Mar 01 '23 at 07:55
this has been discussed in extenso here https://stackoverflow.com/questions/19329318/how-to-correctly-interpret-ggplots-stat-density2d and https://stackoverflow.com/questions/61817440/follow-up-to-stat-contour-2d-bins-interpretation?noredirect=1&lq=1. I would close as a duplicate, but given the bounty, this is not possible — tjebo, Mar 03 '23 at 12:23
@tjebo I thought I would have a go at answering this by showing how to create contour lines that represent quantiles of the data. I don't think any of the linked posts have done that, but let me know if I am interpreting this wrong. — Allan Cameron, Mar 03 '23 at 16:17
@AllanCameron would be interesting how this compares to the probability density which i believe is given with stat_density2d - or the function suggested in https://stackoverflow.com/a/59173290/7941188. I would strongly expect a very similar outline... — tjebo, Mar 03 '23 at 16:36
@tjebo the linked answer does give a similar outline. Difficult to tell if it's identical, though it outputs a data frame of a single contour rather than allowing quantiles to be correctly specified for a multi-contoured plot. `stat_density2d` doesn't seem to match at all; I think it gives proportions of maximal density rather than proportions of points within each contour. — Allan Cameron, Mar 03 '23 at 16:57

Allan Cameron · Accepted Answer · 2023-03-04T16:58:12.890

Although there has been discussion around this before, I thought I would post an answer here that shows how to ensure that a particular proportion of points are included within each contour line.

To do this, we can get the 2d density with MASS::kde2d, then convert to a raster using terra. We can then order the points according to the density in the associated 2d density grid and find the density at which a quantile is passed with approx

density_quantiles <- function(x, y, quantiles) {
  dens <- MASS::kde2d(x, y, n = 500)
  df   <- cbind(expand.grid(x = dens$x, y = dens$y), z = c(dens$z))
  r    <- terra::rast(df)
  ind  <- sapply(seq_along(x), function(i) cellFromXY(r, cbind(x[i], y[i])))
  ind  <- ind[order(-r[ind][[1]])]
  vals <- r[ind][[1]]
  ret  <- approx(seq_along(ind)/length(ind), vals, xout = quantiles)$y
  replace(ret, is.na(ret), max(r[]))
}

This means that if we can specify a vector of quantiles:

quantiles <- c(0, 0.2, 0.4, 0.6, 0.8)

We can get an easy-to-interpret plot showing the areas enclosing 20%, 40%, 60% and 80% of our points like this:

ggplot(dat, aes(X, Y)) +
  geom_density2d_filled(
    aes(fill = after_stat(level)),
    contour_var = "density",
    breaks = density_quantiles(dat$X, dat$Y, quantiles)) +
  geom_point(alpha = 0.1) +
  coord_equal() +
  scale_fill_viridis_d('Quantiles', l
                       abels = scales::percent(quantiles[-1]),
                       direction = -1)

And quartiles would be like this:

quartiles <- c(0, 0.25, 0.5, 0.75)

ggplot(dat, aes(X, Y)) +
  geom_density2d_filled(
    aes(fill = after_stat(level)),
    contour_var = "density",
    breaks = density_quantiles(dat$X, dat$Y, quartiles)) +
  geom_point(alpha = 0.1) +
  coord_equal() +
  scale_fill_viridis_d('Quartiles', labels = scales::percent(quartiles[-1]),
                       direction = -1)

Note that this is very different from the levels in the ndensity plot, which are proportions of the maximum density, not areas containing a fixed proportion of points. In other words, if you looked sideways at a 3-D representation of the ndensity plot, the bands would all have equal height, as the following animation shows (see footnote for the code that produces this):

The code seems to give "strange" results when there are pockets of high density, such as this:

set.seed(123)
dat <-
  data.frame(
    X = c(rnorm(300, 3, 2.5), rnorm(150, 7, 2), rnorm(450, 4, 0.5)),
    Y = c(rnorm(300, 6, 2.5), rnorm(150, 2, 2), rnorm(450, 5, 0.5)))

ggplot(dat, aes(X, Y)) +
  geom_density2d_filled(
    aes(fill = after_stat(level)),
    contour_var = "density",
    breaks = density_quantiles(dat$X, dat$Y, quantiles)) +
  geom_point(alpha = 0.1) +
  coord_equal() +
  scale_fill_viridis_d('Quartiles', labels = scales::percent(quantiles[-1]),
                       direction = -1)

Where we can see that the 80th centile contains "islands", or non-contiguous areas. However, this is simply because we are still plotting density, and these are the areas above a threshold density value that contain the correct number of points. There is no guarantee that the density bands will be single contiguous areas, wherever one sets the thresholds.

We can see this clearly using a 3D plot of the quantiles version of density_quantiles, where little 'lumps' of low density break our thresholds here and there.

dens <- MASS::kde2d(dat$X, dat$Y, n = 1000)

levels <- as.character(cut(dens$z[-1, -1], 
                           breaks = density_quantiles(dat$X, dat$Y, quantiles),
                           labels = c(scales::viridis_pal()(4))))

persp(dens, col = levels, phi = 20, theta = -20, axes = FALSE, border = NA)

If instead you want the areas to be contiguous, then your problem becomes poorly defined. For example, it's not clear how the following density plot should look if you want contiguous areas:

set.seed(123)
dat <-
  data.frame(
    X = c(rnorm(150, 3, 2), rnorm(150, 10, 2)),
    Y = c(rnorm(150, 3, 2), rnorm(150, 10, 2)))

ggplot(dat, aes(X, Y)) +
  geom_density2d_filled(
    aes(fill = after_stat(level)),
    contour_var = "density",
    breaks = density_quantiles(dat$X, dat$Y, quartiles)) +
  geom_point(alpha = 0.1) +
  coord_equal() +
  scale_fill_viridis_d('Quartiles', labels = scales::percent(quartiles[-1]),
                       direction = -1)

In this example, to have contiguous areas, one would have to select one of these clusters as being the central point from which the number of dots is counted. This would give very artificial and misleading results which would not be reflective of density at all, but would have to be a distance function from the point of highest density, and would therefore always be a set of nested circles, like a bullseye.

Code for animation

library(magick)

p1 <- ggplot(dat, aes(X, Y)) +
  geom_density2d_filled(
    aes(fill = after_stat(level)),
    contour_var = "ndensity",
    breaks = seq(0.25, 1, length.out = 4),
    show.legend = FALSE
  ) +
  geom_point(alpha = 0.1) +
  coord_fixed(0.95, expand = FALSE)  +
  theme(plot.margin = margin(75, 50, 60, 50))

ggsave('gg.png', p1, width = 480, height = 480, units = 'px', dpi = 72)

dens <- MASS::kde2d(dat$X, dat$Y, n = 1000)
df2 <- data.frame(x = dens$x, y = apply(dens$z, 1, max)/max(dens$z))

p2 <- ggplot(df2, aes(x, y)) +
  geom_area(fill = scales::viridis_pal()(3)[3]) +
  geom_area(fill = scales::viridis_pal()(3)[2],
            aes(y = ifelse(y > 0.75, 0.75, y))) +
  geom_area(fill = scales::viridis_pal()(3)[1],
            aes(y = ifelse(y > 0.5, 0.5, y))) +
  geom_area(fill = 'gray92',
            aes(y = ifelse(y > 0.25, 0.25, y))) +
  coord_fixed(diff(range(dens$x))) +
  geom_hline(yintercept = c(0, 0.25, 0.5, 0.75, 1)) +
  theme_classic() +
  theme(plot.margin = margin(58, 50, 50, 45))

ggsave('gg2.png', p2, width = 480, height = 480, units = 'px', dpi = 72)

levels <- as.character(cut(dens$z[-1, -1], 
                           breaks = c(0, 0.25, 0.5, 0.75, 1) * max(dens$z),
                           labels = c('gray92', scales::viridis_pal()(3))))

for(i in seq(0, 90, 3)) {
  ragg::agg_png(paste0("persp", sprintf("%02d", i), ".png"))
  persp(dens, col = levels, phi = i, d = 1000, axes = FALSE, 
        box = FALSE, border = NA)
  dev.off()
}

f <- list.files(pattern = 'persp\\d+\\.png', full.names = TRUE) 

c(c(rep(f[31], 10), rev(f), rep(f[1], 10),
    rep('gg2.png', 10), rep(f[1], 10), f, rep(f[31], 5), 
    rep('gg.png', 10))
  ) %>%
  rev(.) %>%
  image_read() %>% 
  image_join() %>% 
  image_animate(fps = 10) %>% 
  image_write("D:\\persp.gif")

Thanks for providing this interesting approach. Using this approach I get the following error: `Error in match(u, v, 0L) : 'match' requires vector arguments`. Do yo have any suggestion on this? — carl, Mar 03 '23 at 21:54
@carl yes - it should have been `terra::intersect` - now updated. — Allan Cameron, Mar 03 '23 at 22:38
This looks very good, but I have found that this approach is tied to the underlying data and fails or gives strange output when there are very dense areas in the chart. Failure occurs when we add the following parts into the data generation process for X and Y respectively: `rnorm(450, 4, 0.2)` and `rnorm(450, 5, 0.2)`. Strange output occurs by adding: `rnorm(450, 4, 0.5)` and `rnorm(450, 5, 0.5)`. — carl, Mar 04 '23 at 12:12
@carl I have simplified and speeded up the function so it now works on both your examples. As for the 'strange output', I think you mean the non-contiguous "islands" within some density bands. This is actually expected behaviour, and I have included an addendum that discusses why this happens and how it is unavoidable. If you want a different output, could you explain in a bit more detail how you would like this to work, including what should be the output in my "two cluster" example? Thanks. — Allan Cameron, Mar 04 '23 at 14:16
Thanks for the further elaborations. Regarding the 'strange output' i was particularly referring to quantiles rather than quartiles. In this case you will have not only the observation of non-contiguous 'islands' but also intersections between different quantiles within islands. — carl, Mar 04 '23 at 14:35
@carl but again, that is to be expected because we are dealing with density. This is just the shape of the 2D density surface. There are 'bumps' of higher density in those intersecting regions. If you didn't have these, whatever you are plotting, it isn't density. Are you perhaps able to give a clearer description of what you think it should look like? — Allan Cameron, Mar 04 '23 at 14:40
Indeed, I agree that this is expected given the densities and I assume useful in this context to have it incorporated in the example. Thanks — carl, Mar 04 '23 at 14:44
@Allan Cameron. Outstanding! Explicitly want to share this and thank you for this answer! — TarJae, Mar 10 '23 at 06:13

Interpretation of 2D density estimate charts

1 Answers1