I'd like to compute aggregated values in a column through time (with "year" being another column in my data.
I know how to do this easily in base R using a loop, but I feel there must be a way to do the same with dplyr
using summarise in combination with something else. I would like to learn that, so I can integrate if in my code better.
I've made a toy example for the same case. Consider this data where we have the number of questions asked in Stackoverflow by someone each year.
> library(tidyverse)
> data <- tribble(~year, ~questions,
2015, 1,
2016, 3,
2016, 2,
2017, 2,
2018, 3,
2018, 7,
2019, 10,
2020, 21)
> data
# A tibble: 6 x 2
year questions
<dbl> <dbl>
1 2015 1
2 2016 3
3 2016 2
4 2017 2
5 2018 3
6 2018 7
7 2019 10
8 2020 21
The following loop will do what I want
> for (i in 1:length(data$year)){
+ data$agg_questions[i] <- sum(data$questions[data$year <= data$year[I]])
+ }
> data
# A tibble: 6 x 3
year questions agg_questions
<dbl> <dbl> <dbl>
1 2015 1 1
2 2016 5 6
3 2017 2 8
4 2018 10 18
5 2019 10 28
6 2020 21 49
And, of course, I'm looking for a way that allows me not to use the loop. Not something like this:
> for (i in 1:length(data$year)){
+ data$agg_questions2[i] <- data %>%
+ filter(year <= data$year[i]) %>%
+ pull(questions) %>%
+ sum()
+ }
> data
# A tibble: 6 x 4
year questions agg_questions agg_questions2
<dbl> <dbl> <dbl> <dbl>
1 2015 1 1 1
2 2016 5 6 6
3 2017 2 8 8
4 2018 10 18 18
5 2019 10 28 28
6 2020 21 49 49
I know if possible to use []
to subset inside the summarise() and mutate() functions, but I've always struggled with that. Thanks!
Is that possible?
EDIT
After reading the first answers, I realised that I had simplified the example too much. I've edited the example data by adding several rows for the same year to make it look more like what I want (and, I think, complicates just using cumsum()
)