2

I'm looking to work out a function to calculate the likelihood of a certain combination for B and R. The current illustration of the data looks like so:

ggplot(df, aes(R,B)) +
geom_bin2d(binwidth = c(1,1))

enter image description here

Is there a way to calculate the probabilities of each combination (e.g. R = 23, B = 30) based on these two discrete correlated variables that are positively skewed?

Could it be possible to use the stat_density_2d to solve or could there be a better way?

Thanks.

timnus
  • 197
  • 10
  • I was wavering between up and downvoting because 1) it is not exactly clear what you want - just calculation of the values or actually plotting? 2) You did not provide sample data. In the end I decided to upvote, because it made @JonSpring giving an interesting answer. – tjebo Dec 28 '19 at 09:57

2 Answers2

5

stat_density_2d uses MASS::kde2d under the hood. I imagine there are slicker ways to do this, but we can feed the data into that function and convert it into tidy data to get a smoothed version for that type of estimate.

First, some data like yours:

library(tidyverse)
set.seed(42)
df <- tibble(
  R = rlnorm(1E4, 0, 0.2) * 100,
  B = R * rnorm(1E4, 1, 0.2)
)

ggplot(df, aes(R,B)) +
  geom_bin2d(binwidth = c(1,1))

enter image description here

Here's running the density and converting into a tibble with the same coordinates as the data. (Are there better ways to do this?)

n = 201 # arbitrary grid size, chosen to be 1 more than the range below 
        #   so the breaks are at integers
smooth <- MASS::kde2d(df$R, df$B, lims = c(0, 200, 0, 200),
                      # h = c(20,20),  # could tweak bandwidth here 
                      n = n) 
df_smoothed <- smooth$z %>% 
  as_tibble() %>%
  pivot_longer(cols = everything(), names_to = "col", values_to = "val") %>% 
  mutate(R = rep(smooth$x, each = n), # EDIT: fixed, these were swapped
         B = rep(smooth$y, n))

df_smoothed now holds all the coordinates from 0:200 in the R and B dimensions, with the probability of each combination in the val column. These add up to 1, of nearly so (99.6% in this case). I think the remaining smidgen is the probabilities of coordinates outside the specified range.

sum(df_smoothed$val)
#[1] 0.9960702

The chances of any particular combination are just the density value at that point. So the chance of R = 70 and B = 100 would be 0.013%.

df_smoothed %>%
  filter(R == 70, B == 100)
## A tibble: 1 x 4
#  col        val     R     B
#  <chr>    <dbl> <int> <int>
#1 V101   0.0000345    70   100

The chance of R between 50-100 and B between 50-100 would be 36.9%:

df_smoothed %>%
  filter(R %>% between(50, 100),
         B %>% between(50, 100)) %>%
  summarize(total_val = sum(val))
## A tibble: 1 x 1
#total_val
#<dbl>
#  1     0.369

Here's how the smooth and the original data look together:

ggplot() +
  geom_tile(data = df_smoothed, aes(R, B, alpha = val), fill = "red") +
  geom_point(data = df %>% sample_n(500), aes(R, B), size = 0.2, alpha = 1/5)

enter image description here

tjebo
  • 21,977
  • 7
  • 58
  • 94
Jon Spring
  • 55,165
  • 4
  • 35
  • 53
  • 1
    That is very interesting. I think the 'error' must somehow lie in the conversion of the matrix to the data frame? One way to interpret the OP is that they actually not want to plot the density estimates, but to retrieve predicted values for each coordinate - so could you possibly elaborate how you converted the predicted density matrix to coordinates? – tjebo Dec 28 '19 at 10:13
  • Thanks for the response @JonSpring, very much appreciated! I did some checks and when I expanded the range for `sum(df_smoothed$val)` it came up as 1. I also am finding the slope is different too for some reason. Thanks for providing the way to calculate from there, I'll try to work out why the gradient is so, but thanks for putting me down the right track! – timnus Dec 29 '19 at 04:31
  • Okay, so if you flip the `df$R` and `df$B` in `kde2d`, it looks like the shape we are after! – timnus Dec 29 '19 at 04:46
  • Thanks for the tip -- it looks like I goofed and swapped the `mutate(R = rep(smooth$x, each = n), B = rep(smooth$y, n))` parts and had the `rep` alternatives mixed up. That fixes the later plotting. – Jon Spring Dec 29 '19 at 06:28
1

If it's only about plotting, one could simply turn off the contours and use geom = raster like suggested in the ggplot2 reference.

Thanks to @JonSpring for the sample data!

library(tidyverse)

df <- tibble(
  R = rlnorm(1E4, 0, 0.2) * 100,
  B = R * rnorm(1E4, 1, 0.2)
)

ggplot(df, aes(R,B)) +
  stat_density2d(geom = 'raster', aes(fill = stat(density)), contour = FALSE) 

Created on 2019-12-28 by the reprex package (v0.3.0)

tjebo
  • 21,977
  • 7
  • 58
  • 94