I have a really long dataframe, with more or less the following structure:
df <- data.frame(
dates = c("2011-10-01","2011-10-01","2011-10-01","2011-10-02","2011-10-03","2011-10-05","2011-10-06","2011-10-06"),
ids = c("A","A","B","C","D","A","E","D"),
values = c(10,1,25,2,5,10,4,1))
> df
dates ids values
1 2011-10-01 A 10
2 2011-10-01 A 1
3 2011-10-01 B 25
4 2011-10-02 C 2
5 2011-10-03 D 5
6 2011-10-05 A 10
7 2011-10-06 E 4
8 2011-10-06 D 1
I would like to get the following output:
dates unique_ids sum_values
1 2011-10-01 2 36
2 2011-10-02 3 38
3 2011-10-03 4 43
4 2011-10-04 4 43
5 2011-10-05 4 53
6 2011-10-06 5 58
i.e. for each date unique_ids gives the number of unique ids corresponding to earlier dates and sum_values gives the sum of values corresponding to earlier dates.
I definitely want to avoid for cycles because the original df too big. So I was thinking to use dplyr.
I know how to obtain the sum_value
df %>%
group_by(dates) %>%
summarize(sum_values_daily = sum(values)) %>%
mutate(sum_values = cumsum(sum_values_daily)) %>%
select(dates, sum_values)
I don't know how to obtains the unique_ids column.
Any Idea?