R: making group_by and summarise more efficient

Question

I've got a data set called data with column headers Region, 2006, 2007, and so on until 2020. The region column gives the name of the area, while the year columns give the population for that year. For example 2006 lists the population for that year in a particular region, 2007 lists the population for that year in a particular region and so on.

The below code gives me my desired output (it shows the total population for each year by region). However, it is very time consuming to type this code out. Is there a way to make this code more efficient and save time typing out 15 different lines?

newData <- data %>%
  group_by(Region) %>%
  summarise(totalPop2006 = sum(`2006`, na.rm = TRUE), 
            totalPop2007 = sum(`2007`, na.rm = TRUE),
            totalPop2008 = sum(`2008`, na.rm = TRUE),
            totalPop2009 = sum(`2009`, na.rm = TRUE),
            totalPop2010 = sum(`2010`, na.rm = TRUE),
            totalPop2011 = sum(`2011`, na.rm = TRUE),
            totalPop2012 = sum(`2012`, na.rm = TRUE),
            totalPop2013 = sum(`2013`, na.rm = TRUE),
            totalPop2014 = sum(`2014`, na.rm = TRUE),
            totalPop2015 = sum(`2015`, na.rm = TRUE),
            totalPop2016 = sum(`2016`, na.rm = TRUE),
            totalPop2017 = sum(`2017`, na.rm = TRUE),
            totalPop2018 = sum(`2018`, na.rm = TRUE),
            totalPop2019 = sum(`2019`, na.rm = TRUE),
            totalPop2020 = sum(`2020`, na.rm = TRUE)       
            ) %>%
  ungroup() %>%
  arrange(Region)

Thanks!

akrun · Answer 1 · 2020-09-09T07:05:23.210

2

We can use summarise with across

library(dplyr)
data %>%
   group_by(Region) %>% 
   summarise(across(`2006`:`2020`, ~ sum(., na.rm = TRUE),
             .names = 'totalPop{col}'), .groups = 'drop') %>%
   arrange(Region)

Using the default dataset 'mtcars'

data(mtcars)
mtcars %>%
     group_by(cyl) %>%
     summarise(across(disp:wt, ~ sum(., na.rm = TRUE), .names = 'totalPop{col}'), 
          .groups = 'drop')
# A tibble: 3 x 5
#    cyl totalPopdisp totalPophp totalPopdrat totalPopwt
#  <dbl>        <dbl>      <dbl>        <dbl>      <dbl>
#1     4        1156.        909         44.8       25.1
#2     6        1283.        856         25.1       21.8
#3     8        4943.       2929         45.2       56.0

Or in base R with aggregate

aggregate(. ~ Region, data[c('Region', 2006:2020)],
         sum, na.rm = TRUE, na.action = NULL)

edited Sep 09 '20 at 07:05

answered Sep 09 '20 at 06:34

akrun

874,273
37
540
662

Do we need `{col}` in `.names`? I have not learned this way yet. Thanks for showing it. – jazzurro Sep 09 '20 at 07:14
1

@jazzurro By default, the `across` returns the orignal column names. Here, the OP may require a modification of column name. The prefix 'totalPop' is added along with {col} which is the orignal column name. By using the `{col}` we can modify either prefix or suffix or can have both by placing the {col} in middle – akrun Sep 09 '20 at 07:16

R: making group_by and summarise more efficient

1 Answers1