1

I am trying to use R's dplyr package to create multiple new columns for each year in my dataset that is the sum of the columns corresponding to each year's end of quarter figures (Mar, Jun, Sep, Dec). The only way I have been able to figure out how to do this "efficiently" is with a for loop. But something tells me that there's an alternative, more efficient, or better way of going about this (maybe I should be using a map function here, but I'm just not sure?). Here's a toy example that can be reproduced:

library(tidyverse)
library(glue)

# Create a toy example and print the resulting tibble
set.seed(100) # make results reproducible by setting seed
vars <- c("AgeGroup", paste0(month.abb[seq(3, 12, 3)], "_", rep(15:17, each = 4)))

(df <- cbind(LETTERS[1:5], matrix(rpois(n = (length(vars) - 1) * 5, 30), nrow = 5)) %>% 
    data.frame() %>%
    setNames(vars) %>% 
    tibble() %>% 
    mutate(across(-1, as.integer))
  )

which sets up the example/reproducible dataset as:

# A tibble: 5 × 13
  AgeGroup Mar_15 Jun_15 Sep_15 Dec_15 Mar_16 Jun_16 Sep_16 Dec_16 Mar_17 Jun_17 Sep_17 Dec_17
  <chr>     <int>  <int>  <int>  <int>  <int>  <int>  <int>  <int>  <int>  <int>  <int>  <int>
1 A            27     26     33     36     34     25     27     37     37     32     37     30
2 B            21     32     24     31     25     39     32     20     30     32     25     26
3 C            34     28     30     23     25     29     35     26     19     30     28     29
4 D            30     32     29     34     31     29     35     37     28     34     31     50
5 E            31     33     27     31     23     26     29     28     28     26     19     37

So what I would like to do is create one new variable for each year ('15, '16, and '17) called sum_15, sum_16, and sum_17 which are the sum of all the months' values from variables that end with the corresponding two digit year (e.g. ends_with("15"), ends_with("16"), ends_with("17")).

I have been able to achieve the desired result with the following code, but I'd rather not use a loop, if I can get away with the judicious application of an across statement or perhaps a map function (or some other method you all might suggest):

# This works, but I'd rather not use a for loop if I can avoid it:
for (i in 15:17) {
  df <- df %>% mutate("sum_{i}" := rowSums(across(ends_with(glue("_{i}")))))
}

#write out the df that displays what I am trying to achieve
df %>% select(AgeGroup, starts_with("sum"))

# A tibble: 5 × 4
  AgeGroup sum_15 sum_16 sum_17
  <chr>     <dbl>  <dbl>  <dbl>
1 A           122    123    136
2 B           108    116    113
3 C           115    115    106
4 D           125    132    143
5 E           122    106    110

I looked through other examples on SO, but all of the examples I have found are overly simplistic and seem to create only a single variable at a time by manually creating them in the mutate statement—something along the lines of:

df %>% mutate(sum15 = rowSums(across(ends_with("_15"))),
              sum16 = rowSums(across(ends_with("_16"))),
              sum17 = rowSums(across(ends_with("_17"))),
              )

This is obviously not what I'm looking for since this is basically a more manual way of doing what I'm already doing with the for loop.

Could anyone offer any suggestions on how to improve this code and avoid the for loop?

Thank you so much!

StatsStudent
  • 1,384
  • 2
  • 10
  • 28
  • 2
    Pivot longer and then group by and summarize. – Michael Dewar Aug 14 '23 at 03:09
  • 1
    There are so many excellent answers to this question, I'm torn on which one to accept as an answer. I wish I could accept them all! I think in the end, I'm going to select @Onyambu's answer since it provides several options and I like it's simplicity. Thank you to everyone who provided a response. I wish I could award you all an "accepted answer." – StatsStudent Aug 14 '23 at 15:58

5 Answers5

5

Another way would be:

df %>%
   pivot_longer(-AgeGroup, names_pattern = "(\\d+)")%>%
   pivot_wider(values_fn = sum, names_prefix = 'Sum_')

# A tibble: 5 × 4
  AgeGroup Sum_15 Sum_16 Sum_17
  <chr>     <int>  <int>  <int>
1 A           122    123    136
2 B           108    116    113
3 C           115    115    106
4 D           125    132    143
5 E           122    106    110

And then you can join to the original df


If you do not have an idea about the names_pattern you can use names_sep:

df %>%
   pivot_longer(-AgeGroup, names_to = c(NA, 'name'), names_sep = "_")%>%
   pivot_wider(values_fn = sum, names_prefix = 'Sum_')
# A tibble: 5 × 4
  AgeGroup Sum_15 Sum_16 Sum_17
  <chr>     <int>  <int>  <int>
1 A           122    123    136
2 B           108    116    113
3 C           115    115    106
4 D           125    132    143
5 E           122    106    110

in base R you could do:

sapply(split.default(df[-1], sub(".*_", "Sum_", names(df)[-1])), rowSums)
     Sum_15 Sum_16 Sum_17
[1,]    122    123    136
[2,]    108    116    113
[3,]    115    115    106
[4,]    125    132    143
[5,]    122    106    110

Which you could cbind to the original dataframe ie

cbind(df, sapply(split.default(df[-1], sub(".*_", "Sum_", names(df)[-1])), rowSums))
Onyambu
  • 67,392
  • 3
  • 24
  • 53
2

One possibility:

bind_cols(df,
  df %>%
    pivot_longer(-AgeGroup) %>%
    count(AgeGroup, yr = paste0("sum", str_sub(name, start = 4)), wt = value) %>% 
    pivot_wider(names_from = yr, values_from = n) %>%
    select(-AgeGroup)
  )


# A tibble: 5 × 16
  AgeGroup Mar_15 Jun_15 Sep_15 Dec_15 Mar_16 Jun_16 Sep_16 Dec_16 Mar_17 Jun_17 Sep_17 Dec_17 sum_15 sum_16 sum_17
  <chr>     <int>  <int>  <int>  <int>  <int>  <int>  <int>  <int>  <int>  <int>  <int>  <int>  <int>  <int>  <int>
1 A            27     26     33     36     34     25     27     37     37     32     37     30    122    123    136
2 B            21     32     24     31     25     39     32     20     30     32     25     26    108    116    113
3 C            34     28     30     23     25     29     35     26     19     30     28     29    115    115    106
4 D            30     32     29     34     31     29     35     37     28     34     31     50    125    132    143
5 E            31     33     27     31     23     26     29     28     28     26     19     37    122    106    110
Jon Spring
  • 55,165
  • 4
  • 35
  • 53
1

Not a tidyverse answer but you can use lapply and cbind

years <- gsub("(.+_)(\\d+)", replacement = "\\2", names(df)[grepl("[0-9]", names(df))]) |> 
  unique()

cbind(df[1], do.call(cbind, lapply(setNames(years, paste0("sum_", years)), \(x){
   df[grepl(x, names(df))] |> 
     rowSums()
})))


AgeGroup sum_15 sum_16 sum_17
1        A    122    123    136
2        B    108    116    113
3        C    115    115    106
4        D    125    132    143
5        E    122    106    110

Just James
  • 1,222
  • 2
  • 7
1

Just for a variation on the previous answer. This one also uses pivot_longer but (1) splits variable names into month and year using pivot_longer arguments, (2) creates the counts by group summary on AgeGroup and year and (3) uses a join back to the original data, rather than binding columns.

library(dplyr)

df %>% 
  left_join(pivot_longer(df, 
                         -AgeGroup, 
                         names_to = c("month", "year"), 
                         names_sep = "_") %>% 
  group_by(AgeGroup, year) %>% 
  summarise(sum = sum(value)) %>% 
  pivot_wider(names_from = "year", 
              names_prefix = "sum_",
              values_from = "sum"))

Result:

# A tibble: 5 × 16
  AgeGroup Mar_15 Jun_15 Sep_15 Dec_15 Mar_16 Jun_16 Sep_16 Dec_16 Mar_17 Jun_17     Sep_17 Dec_17 sum_15 sum_16 sum_17
  <chr>     <int>  <int>  <int>  <int>  <int>  <int>  <int>  <int>  <int>  <int>  <int>  <int>  <int>  <int>  <int>
1 A            27     26     33     36     34     25     27     37     37     32     37     30    122    123    136
2 B            21     32     24     31     25     39     32     20     30     32     25     26    108    116    113
3 C            34     28     30     23     25     29     35     26     19     30     28     29    115    115    106
4 D            30     32     29     34     31     29     35     37     28     34     31     50    125    132    143
5 E            31     33     27     31     23     26     29     28     28     26     19     37    122    106    110
neilfws
  • 32,751
  • 5
  • 50
  • 63
1

Alternatively please try the below code with sundf custom function and map

sumdf <- \(x){
  new <- paste0('sum',x)
  df %>% 
    mutate(!!new :=rowSums(across(ends_with(x)))) %>% select(!!new)
}

new <- map(c('15','16','17'), \(x) sumdf(x))

do.call(cbind,list(df,new))


  AgeGroup Mar_15 Jun_15 Sep_15 Dec_15 Mar_16 Jun_16 Sep_16 Dec_16
1        A     27     26     33     36     34     25     27     37
2        B     21     32     24     31     25     39     32     20
3        C     34     28     30     23     25     29     35     26
4        D     30     32     29     34     31     29     35     37
5        E     31     33     27     31     23     26     29     28
  Mar_17 Jun_17 Sep_17 Dec_17 sum15 sum16 sum17
1     37     32     37     30   122   123   136
2     30     32     25     26   108   116   113
3     19     30     28     29   115   115   106
4     28     34     31     50   125   132   143
5     28     26     19     37   122   106   110

jkatam
  • 2,691
  • 1
  • 4
  • 12