2

I have a really long dataframe, with more or less the following structure:

df <- data.frame(
dates = c("2011-10-01","2011-10-01","2011-10-01","2011-10-02","2011-10-03","2011-10-05","2011-10-06","2011-10-06"),
ids = c("A","A","B","C","D","A","E","D"),
values = c(10,1,25,2,5,10,4,1))

> df
       dates ids values
1 2011-10-01   A     10
2 2011-10-01   A      1
3 2011-10-01   B     25
4 2011-10-02   C      2
5 2011-10-03   D      5
6 2011-10-05   A     10
7 2011-10-06   E      4
8 2011-10-06   D      1

I would like to get the following output:

       dates   unique_ids sum_values
1 2011-10-01            2         36
2 2011-10-02            3         38
3 2011-10-03            4         43
4 2011-10-04            4         43
5 2011-10-05            4         53
6 2011-10-06            5         58

i.e. for each date unique_ids gives the number of unique ids corresponding to earlier dates and sum_values gives the sum of values corresponding to earlier dates.

I definitely want to avoid for cycles because the original df too big. So I was thinking to use dplyr.

I know how to obtain the sum_value

df %>%
group_by(dates) %>%
summarize(sum_values_daily = sum(values)) %>%
mutate(sum_values = cumsum(sum_values_daily)) %>%
select(dates, sum_values)

I don't know how to obtains the unique_ids column.

Any Idea?

mtoto
  • 23,919
  • 4
  • 58
  • 71
gico
  • 195
  • 1
  • 11
  • `df %>% group_by(dates) %>% summarise(unique_ids = n_distinct(ids), sum_values = sum(values))` – Ronak Shah Apr 14 '17 at 11:13
  • Hi Ronak, what you suggest is not the desired result, I want as uniques_ids for a certain date the number of unique ids of all the previous dates, as well the sum of the values corresponding to the previous dates. – gico Apr 14 '17 at 11:17
  • 1
    please double check your expected output, i dont think its correct. – mtoto Apr 14 '17 at 11:44
  • mtoto I think it is correct but not clearly explained. The unique # of id is 2 and separately, the sum of all values for 2011-10-01 is 36, but you are not only adding the the values for the unique id #s. – sconfluentus Apr 14 '17 at 11:49
  • @mtoto the desired output is correct. I do not want a simply df %>% group_by(date) %>% summarize(unique_ids=n_distinct(ids), sum_values = sum(values)) – gico Apr 14 '17 at 11:53
  • There should be one row for `2011-10-05` as we group by date. – mt1022 Apr 14 '17 at 11:55
  • Right sorry for the mistake – gico Apr 14 '17 at 11:58

3 Answers3

2

As an alternative, Here is a data.table solution. For clarity, I'll present a three line version, though these lines could be chained into a single line.

library(data.table)
# convert to data.table and make dates a Data data type
setDT(df)[, "dates" := as.Date(dates)]
# merge on the daily values (missing 10-04 in original data)
# convert NAs to 0 for missing dates, calculate cumulative sums of unique ID and values
df <- df[.(seq.Date(min(dates), max(dates), by="day")), on="dates",
         .(dates, values=cumsum(ifelse(is.na(values), 0, values)),
           unique_ids=cumsum(!duplicated(ids) & !is.na(ids)))]
# aggregate by date, saving the max of unique ID and value
df <- df[, .(unique_ids=max(unique_ids), sum_values=max(values)), by=dates]

This returns

df 
        dates unique_ids sum_values
1: 2011-10-01          2         36
2: 2011-10-02          3         38
3: 2011-10-03          4         43
4: 2011-10-04          4         43
5: 2011-10-05          4         53
6: 2011-10-06          5         58
mtoto
  • 23,919
  • 4
  • 58
  • 71
lmo
  • 37,904
  • 9
  • 56
  • 69
  • 1
    @mtoto Thanks for catching the other NA... On your second point, you are partially correct. When using `:=` for example, as in `setDT(df)[, "dates" := as.Date(dates)]` above, data.table assigns by reference. However, in most other operations, it makes a copy (another exception is `set`). To see this, try the first two lines of my code, without the `df <-` in the second line. then print `df` in the console. You will notice that 10-04 is missing. – lmo Apr 14 '17 at 13:23
1

Because you are trying to calculate the number of distinct ids across groups, first we'll need to define a boolean column that will allow us to sum only the unique values.

Secondly, you want to include missing dates from your original df in your expected output, so we'll also need to perform a right_join with the full sequence of dates. I assume here that your dates column is already of class Date. This will produce NA values that we replace by 0.

Finally we calculate the cumsum for both unique_ids and sum_values.

library(dplyr)

df %>% mutate(unique_ids = !duplicated(ids)) %>%
        group_by(dates) %>%
        summarise(unique_ids = sum(unique_ids),
                  sum_values = sum(values)) %>%
        right_join(data.frame(dates = seq(min(df$date), 
                                          max(df$dates), 
                                          by = 1))) %>%
        mutate_each(funs(replace(., is.na(.), 0)), -dates)  %>%
        mutate_each(funs(cumsum), -dates)
#       dates unique_ids sum_values
#      <date>      <dbl>      <dbl>
#1 2011-10-01          2         36
#2 2011-10-02          3         38
#3 2011-10-03          4         43
#4 2011-10-04          4         43
#5 2011-10-05          4         53
#6 2011-10-06          5         58
mtoto
  • 23,919
  • 4
  • 58
  • 71
1
library(dplyr)
library(purrr)
df %>% 
    mutate(dates = as.Date(dates), ids = as.character(ids)) %>% 
    group_by(dates) %>%
    summarise(ids = list(unique(ids)), values = sum(values)) %>%
    merge(data.frame(dates = seq.Date(min(.$dates), max(.$dates), "day")), all.y = TRUE) %>% 
    transmute(
        dates,
        uniqe_ids = map_int(accumulate(ids, ~unique(c(.x, .y))), length),
        sum_values = accumulate(values, ~sum(.x, .y, na.rm = TRUE))
    )

The first part, mutate, group_by and then summarise I guess, is easy to understand: some preparations, than we collect unique ids and sum values within each day. The result is:

# A tibble: 5 × 3
       dates       ids values
      <date>    <list>  <dbl>
1 2011-10-01 <chr [2]>     36
2 2011-10-02 <chr [1]>      2
3 2011-10-03 <chr [1]>      5
4 2011-10-05 <chr [1]>     10
5 2011-10-06 <chr [2]>      5

Then we merge this with data.frame(dates = seq.Date(min(.$dates), max(.$dates), "day")) to fill in possible date gaps.

Now we need to sort of walk from up to down cumulatively on ids and values variables. For ids we firstly use purrr::accumulate with ~unique(c(.x, .y)) function. This means that we start with the first cell of ids and then concatenate it (c) step-by-step with the following cells keeping the unique values only. So in our case this would evaluate to:

[[1]]
[1] "A" "B"

[[2]]
[1] "A" "B" "C"

[[3]]
[1] "A" "B" "C" "D"

[[4]]
[1] "A" "B" "C" "D"

[[5]]
[1] "A" "B" "C" "D" "E"

But we need to know just the number of distinct ids so we map then length function on it with the help of purrr::map_int.

For sum_values we calculate the cumulative sum (we can't use cumsum because we may have NAs after the merging).

Iaroslav Domin
  • 2,698
  • 10
  • 19