I'm trying to summarise multiple columns in a data frame using dplyr's group_by/summarise. If there is a dependency on an earlier column in one of the later columns, summarise uses the already summarised values. Is there a way to avoid this behaviour and use the original values?
I can of course reorder the way I summarise or give the summarised column with dependencies a new name and rename it later. However, the bahaviour is somewhat unexpected and therefore I was wondering if the is a way to avoid this. I have the latest version of dplyr (Version 0.8.0.1).
library(dplyr)
# Create data frame with data and group column
df <- data.frame(observation = rnorm(5000),
group = rep(1:1000, each = 5))
# Summarise to mean observation --> Standard deviation is NA
df %>%
group_by(group) %>%
summarise(observation = mean(observation), std = sd(observation) %>%
View
# Possible solution: rename variable --> Standard deviation is calculated
df %>%
group_by(group) %>%
summarise(observation_mean = mean(observation), std = sd(observation)) %>%
rename(observation = observation_mean) %>%
View
In the first group_by/summarise, there is no standard deviation calculated, as dplyr works with the already updated value, which is only 1. In the second group_by/summarise, the original observations are still available and the standard deviation is calcuated as expected.