0

I'm trying to summarise multiple columns in a data frame using dplyr's group_by/summarise. If there is a dependency on an earlier column in one of the later columns, summarise uses the already summarised values. Is there a way to avoid this behaviour and use the original values?

I can of course reorder the way I summarise or give the summarised column with dependencies a new name and rename it later. However, the bahaviour is somewhat unexpected and therefore I was wondering if the is a way to avoid this. I have the latest version of dplyr (Version 0.8.0.1).

library(dplyr)

# Create data frame with data and group column
df <- data.frame(observation = rnorm(5000), 
                 group = rep(1:1000, each = 5))

# Summarise to mean observation --> Standard deviation is NA
df %>% 
  group_by(group) %>% 
  summarise(observation = mean(observation), std = sd(observation) %>% 
  View

# Possible solution: rename variable --> Standard deviation is calculated
df %>% 
  group_by(group) %>% 
  summarise(observation_mean = mean(observation), std = sd(observation)) %>% 
  rename(observation = observation_mean) %>% 
  View

In the first group_by/summarise, there is no standard deviation calculated, as dplyr works with the already updated value, which is only 1. In the second group_by/summarise, the original observations are still available and the standard deviation is calcuated as expected.

Tom
  • 532
  • 3
  • 11
  • I think renaming is the way to go, else you overwrite you variable, and then try to calculate the standard deviation on just a single number, which will not work. – Esben Eickhardt Apr 24 '19 at 08:14
  • 1
    Friendly tip: `lapply(1:1000, rep,5) %>% unlist` can be simplified to `rep(1:1000, each = 5)`. – s_baldur Apr 24 '19 at 08:20
  • Possible duplicate of [Using dplyr to summarize and keep the same variable name](https://stackoverflow.com/questions/48357867/using-dplyr-to-summarize-and-keep-the-same-variable-name) – Lennyy Apr 24 '19 at 09:05

0 Answers0