How can I optimize the dplyr code by group if all calculations are the same

Question

I have the following data frame, which is a subset of a much larger one containing over 3 million rows.

df <- data.frame(Group = c(1,1,1,2,2,3,3,3,2,2,4,4,1,4,1,3,1,3,2,4,2,1,3,2,4),
              SubGroup = c("A","A","C","B","C","A","B","C","C","A","B","C","C","A","B","C","A","A","B","A","C","C","B","B","C"),
              Type = c("Z","Y","Z","X","X","Y","Z","X","Z","Y","Y","Y","X","Y","Z","X","X","Z","Y","X","Z","Z","Y","X","Y"),
              Value = c(4,7,2,8,4,6,1,8,3,6,3,6,3,5,6,7,3,2,7,4,9,4,7,8,1),
              Dup = c(1,1,0,0,0,1,0,1,1,0,1,0,0,1,0,1,0,1,0,1,1,0,1,1,1))

Per Group, there are five different formulas that I want to calculate. 1) A sum of Value by SubGroup 2) A sum of Value by SubGroup and Dup 3) A sum of Value by SubGroup and Type Z 4) A sum of Value by SubGroup and Type Y 5) A sum of Value by SubGroup and Type X

Here is the code that I currently have using dplyr:

result <- df %>%
 group_by(Group) %>%
 summarise(SubGroup.A.Total = sum(Value[SubGroup=="A"]),
        SubGroup.A.Dup = sum(Value[SubGroup=="A" & Dup==1]),
        SubGroup.A.TypeZ = sum(Value[SubGroup=="A" & Type=="Z"]),
        SubGroup.A.TypeY = sum(Value[SubGroup=="A" & Type=="Y"]),
        SubGroup.A.TypeX = sum(Value[SubGroup=="A" & Type=="X"]),
        SubGroup.B.Total = sum(Value[SubGroup=="B"]),
        SubGroup.B.Dup = sum(Value[SubGroup=="B" & Dup==1]),
        SubGroup.B.TypeZ = sum(Value[SubGroup=="B" & Type=="Z"]),
        SubGroup.B.TypeY = sum(Value[SubGroup=="B" & Type=="Y"]),
        SubGroup.B.TypeX = sum(Value[SubGroup=="B" & Type=="X"]),
        SubGroup.C.Total = sum(Value[SubGroup=="C"]),
        SubGroup.C.Dup = sum(Value[SubGroup=="C" & Dup==1]),
        SubGroup.C.TypeZ = sum(Value[SubGroup=="C" & Type=="Z"]),
        SubGroup.C.TypeY = sum(Value[SubGroup=="C" & Type=="Y"]),
        SubGroup.C.TypeX = sum(Value[SubGroup=="C" & Type=="X"]))

However, I really want to know if there is a more efficient way of doing this, both from a run time and number of lines perspective? Since I am in essence looping the same five formulas by SubGroup, I imagine there is a much simpler way of writing this.

score 1 · Answer 1 · answered Aug 01 '19 at 05:50

This answer uses pivot_wider, which so far is only available in the dev version of tidyr.

You are grouping by SubGroup for all the variables you want to calculate, so it's worth adding that to the group_by. For some of the variables you're also grouping by Type, so I guess you could do a separate grouping by Group, SubGroup, Type and join those variables back in, but it's not clear if that's worth it.

Sticking to just group_by(Group, SubGroup), you can do:

df %>%
    group_by(Group, SubGroup) %>%
    summarize(
        Total = sum(Value),
        Dup = sum(Value[Dup == 1]),
        X = sum(Value[Type == "X"]),
        Y = sum(Value[Type == "Y"]),
        Z = sum(Value[Type == "Z"])
    ) %>%
    pivot_wider(
        names_from = SubGroup,
        values_from = c(-Group, -SubGroup)
    )

score 1 · Answer 2 · answered Aug 01 '19 at 10:37

Similar to @Marius 's answer:

do.call(
  rbind,
  by(df, list(df$Group, df$SubGroup), FUN = function(x) {
    within(x, {
      value_sum <- sum(Value)
      value_sum_dup1 <- sum(Value[Dup == 1])
      value_sum_dup0 <- sum(Value[Dup == 0])
      value_z <- sum(Value[Type == "Z"])
      value_y <- sum(Value[Type == "Y"])
      value_x <- sum(Value[Type == "Z"])
    })[1, ]
  })
)

How can I optimize the dplyr code by group if all calculations are the same

2 Answers2