Aggregated sum through time using dplyr and the tidyverse

Question

I'd like to compute aggregated values in a column through time (with "year" being another column in my data. I know how to do this easily in base R using a loop, but I feel there must be a way to do the same with dplyr using summarise in combination with something else. I would like to learn that, so I can integrate if in my code better.

I've made a toy example for the same case. Consider this data where we have the number of questions asked in Stackoverflow by someone each year.

> library(tidyverse)
> data <- tribble(~year, ~questions,
                2015,  1,
                2016,  3,
                2016,  2,
                2017,  2,
                2018,  3,
                2018,  7,
                2019, 10,
                2020, 21)
> data
# A tibble: 6 x 2
   year questions
  <dbl>     <dbl>
1  2015         1
2  2016         3
3  2016         2
4  2017         2
5  2018         3
6  2018         7
7  2019        10
8  2020        21

The following loop will do what I want

> for (i in 1:length(data$year)){
+   data$agg_questions[i] <- sum(data$questions[data$year <= data$year[I]])
+ }
> data
# A tibble: 6 x 3
   year questions agg_questions
  <dbl>     <dbl>         <dbl>
1  2015         1             1
2  2016         5             6
3  2017         2             8
4  2018        10            18
5  2019        10            28
6  2020        21            49

And, of course, I'm looking for a way that allows me not to use the loop. Not something like this:

> for (i in 1:length(data$year)){
+   data$agg_questions2[i] <- data %>% 
+   filter(year <= data$year[i]) %>% 
+   pull(questions) %>% 
+   sum()
+ } 
> data
# A tibble: 6 x 4
   year questions agg_questions agg_questions2
  <dbl>     <dbl>         <dbl>          <dbl>
1  2015         1             1              1
2  2016         5             6              6
3  2017         2             8              8
4  2018        10            18             18
5  2019        10            28             28
6  2020        21            49             49

I know if possible to use [] to subset inside the summarise() and mutate() functions, but I've always struggled with that. Thanks! Is that possible?

EDIT

After reading the first answers, I realised that I had simplified the example too much. I've edited the example data by adding several rows for the same year to make it look more like what I want (and, I think, complicates just using cumsum())

`data %>% arrange(year) %>% mutate(questions = cumsum(questions))` — Ronak Shah, Aug 07 '20 at 12:41

Eric · Accepted Answer · 2020-08-07T16:06:45.593

4

You can do this by using summarise and sum to create a year totals column and mutate with cumsum to create a column that provides cumulative sums over the years.

library(dplyr)

data <- tribble(~year, ~questions,
                2015,  1,
                2016,  3,
                2016,  2,
                2017,  2,
                2018,  3,
                2018,  7,
                2019, 10,
                2020, 21)

data %>%     
  group_by(year) %>%     
  summarise(year_total = sum(questions)) %>%    
  mutate(cum_over_years = cumsum(year_total))

#> # A tibble: 6 x 3
#>    year year_total cum_over_years
#>   <dbl>      <dbl>          <dbl>
#> 1  2015          1              1
#> 2  2016          5              6
#> 3  2017          2              8
#> 4  2018         10             18
#> 5  2019         10             28
#> 6  2020         21             49

edited Aug 07 '20 at 16:06

answered Aug 07 '20 at 13:08

Eric

2,699
5
17

Thanks for the answer @eric-fletcher, but it made me realise the example was too simple, with just one row per year. I want to apply to more complex cases. I'll edit and update the example – Javier Fajardo Aug 07 '20 at 14:47
1

They closed your question but just add one line to Eric's answer. `data %>% group_by(year) %>% summarise(year_total = sum(questions)) %>% mutate(cum_over_years = cumsum(year_total))` – Chuck P Aug 07 '20 at 15:55
Thank you Chuck P. I have updated my answer. – Eric Aug 07 '20 at 16:08
Thanks both, helpful and simple! I would add that on a real case scenario it's probably worth it to make sure you arrange by `year` before using `cumsum()`. That was my worry, `cumsum()` just sums all previous rows, but does not control order based on another column. I was looking for a function that was able to do that (sum + considering a second column to give the order), but it makes sense to get there using `summarise + arrange + mutate(cumsum())`. `data %>% group_by(year) %>% summarise(year_total = sum(questions)) %>% arrange(year) %>% mutate(cum_over_years = cumsum(year_total))` – Javier Fajardo Aug 07 '20 at 16:26

score 2 · Answer 2 · answered Aug 07 '20 at 13:17

library(tibble)

data <- tribble(~year, ~questions,
                2015,  1,
                2016,  3,
                2017,  2,
                2018,  3,
                2019, 10,
                2020, 21)

In base R:

data <- as.data.frame(data)
data$agg_questions <- cumsum(data$questions) 

> data
  year questions agg_questions
1 2015         1             1
2 2016         3             4
3 2017         2             6
4 2018         3             9
5 2019        10            19
6 2020        21            40

In data.table:

library(data.table)    
data <- as.data.table(data)
data[, agg_questions := cumsum(questions)]

> data
   year questions agg_questions
1: 2015         1             1
2: 2016         3             4
3: 2017         2             6
4: 2018         3             9
5: 2019        10            19
6: 2020        21            40

Aggregated sum through time using dplyr and the tidyverse

2 Answers2