0

I'm using ggplot geom_vline in combination with a custom function to plot certain values on top of a histogram.

The example function below e.g. returns a vector of three values (the mean and x sds below or above the mean). I can now plot these values in geom_vline(xintercept) and see them in my graph.

#example function
sds_around_the_mean <- function(x, multiplier = 1) {
  mean <- mean(x, na.rm = TRUE)
  sd <- sd(x, na.rm = TRUE)
  
  tibble(low   = mean - multiplier * sd,
         mean  = mean,
         high  = mean + multiplier * sd) %>% 
    pivot_longer(cols = everything()) %>% 
    pull(value)
}

Reproducible data

    #data
set.seed(123)
normal <- tibble(data = rnorm(1000, mean = 100, sd = 5))
outliers <- tibble(data = runif(5, min = 150, max = 200))

df <- bind_rows(lst(normal, outliers), .id = "type")

df %>% 
  ggplot(aes(x = data)) + 
  geom_histogram(bins = 100) + 
  geom_vline(xintercept = sds_around_the_mean(df$data, multiplier = 3),
             linetype = "dashed", color = "red") + 
  geom_vline(xintercept = sds_around_the_mean(df$data, multiplier = 2),
             linetype = "dashed")

example_hist

The problem is, that as you can see I would have to define data$df at various places. This becomes more error-prone when I apply any change to the original df that I pipe into ggplot, e.g. filtering out outliers before plotting. I would have to apply the same changes again at multiple places.

E.g.
df %>% filter(type == "normal")
#also requires 
df$data 
#to be changed to 
df$data[df$type == "normal"] 
#in geom_vline to obtain the correct input values for the xintercept.

So instead, how could I replace the df$data argument with the respective column of whatever has been piped into ggplot() in the first place? Something similar to the "." operator, I assume. I've also tried stat_summary with geom = "vline" to achieve this, but without the desired effect.

Rasul89
  • 588
  • 2
  • 5
  • 14

1 Answers1

1

You can enclose the ggplot part in curly brackets and reference the incoming dataset with the . symbol both in the ggplot command and when calculating the sds_around_the_mean. This will make it dynamic.

df %>% 
  {ggplot(data = ., aes(x = data)) + 
  geom_histogram(bins = 100) + 
  geom_vline(xintercept = sds_around_the_mean(.$data, multiplier = 3),
             linetype = "dashed", color = "red") + 
  geom_vline(xintercept = sds_around_the_mean(.$data, multiplier = 2),
             linetype = "dashed")}
pieterbons
  • 1,604
  • 1
  • 11
  • 14
  • Thank you, this was easier than expected. What is the logic behind the curly brackets here? Some form of tidy evaluation? Does this explicitly make the data argument dynamic or could you apply this to every dplyr pipeline? – Rasul89 Jun 28 '22 at 11:45
  • The curly brackets suppress the default behavior of automatically using the lhs (input from pipe) as the first argument for the rhs call making it easier to use the dot placeholder several times. (summarised from the magrittr documentation available via ?magrittr::`%>%`) – pieterbons Jun 28 '22 at 12:38