do() superseded! Alternative is to use across(), nest_by(), and summarise, how?

Question

I'm doing something quite simple. Given a dataframe of start dates and end dates for specific periods I want to expand/create a full sequence for each period binned by week (with the factor for each row), then output this in a single large dataframe.

For instance:

library(tidyverse)
library(lubridate)

# Dataset
  start_dates = ymd_hms(c("2019-05-08 00:00:00",
                          "2020-01-17 00:00:00",
                          "2020-03-03 00:00:00",
                          "2020-05-28 00:00:00",
                          "2020-12-10 00:00:00",
                          "2021-05-07 00:00:00",
                          "2022-01-04 00:00:00"), tz = "UTC")
  
  end_dates = ymd_hms(c( "2019-10-24 00:00:00",
                         "2020-03-03 00:00:00", 
                         "2020-05-28 00:00:00",
                         "2020-12-10 00:00:00",
                         "2021-05-07 00:00:00",
                         "2022-01-04 00:00:00",
                         "2022-01-19 00:00:00"), tz = "UTC") 
  
  df1 = data.frame(studying = paste0("period",seq(1:7),sep = ""),start_dates,end_dates)

It was suggested to me to use do(), which currently works fine but I hate it when things are superseded. I also have a way of doing it using map2. But reading the file (https://dplyr.tidyverse.org/reference/do.html) suggests you can use nest_by(), across() and summarise() to do the same job as do(), how would I go about getting same result? I've tried a lot of things but I just can't seem to get it.

# do() way to do it
df1 %>% 
  group_by(studying) %>% 
  do(data.frame(week=seq(.$start_dates,.$end_dates,by="1 week")))

# transmute() way to do it
 df1 %>% 
  transmute(weeks = map2(start_dates,end_dates, seq, by = "1 week"), studying) 
 %>% unnest(cols = c(weeks))

`across` is going to iterate across columns, but you want to operate on both date columns simultaneously, so I don't think it is appropriate. `nest_by(x, y)` is effectively the same as `group_by(x, y) %>% nest()`, so it also doesn't really do what you need, since you'll still need to iterate over each group to come up with the sequence. `summarize` can be used in place of `transmute`. In the end, your `transmute` code (and my deleted answer, posted before reading the rest of your question) is the way to go. — r2evans, Jan 19 '22 at 16:21
What about the `transmute` version isn't what you want? I think the nest / across / summarise suggestion is just that, a suggestion, based on the assumption that `do` was often used for more complicated summarizations or operations to be done on entire data frames — camille, Jan 19 '22 at 16:36
Nothing specific, It's just I was interested in knowing alternatives. As they specifically mention nest_by() I thought it could be useful to know. — Dasr, Jan 19 '22 at 16:38

score 3 · Answer 1 · answered Jan 19 '22 at 17:11

You can also use tidyr::complete:

df1 %>% 
  group_by(studying) %>% 
  complete(start_dates = seq(from = start_dates, to = end_dates, by = "1 week")) %>% 
  select(-end_dates, weeks = start_dates)

# A tibble: 134 x 2
# Groups:   studying [7]
   studying weeks              
   <chr>    <dttm>             
 1 period1  2019-05-08 00:00:00
 2 period1  2019-05-15 00:00:00
 3 period1  2019-05-22 00:00:00
 4 period1  2019-05-29 00:00:00
 5 period1  2019-06-05 00:00:00
 6 period1  2019-06-12 00:00:00
 7 period1  2019-06-19 00:00:00
 8 period1  2019-06-26 00:00:00
 9 period1  2019-07-03 00:00:00
10 period1  2019-07-10 00:00:00
# ... with 124 more rows

TimTeaFan · Accepted Answer · 2022-01-20T20:08:55.423

As the documentation of ?do suggests, we can now use summarise and replace the . with across():

library(tidyverse)
library(lubridate)

df1 %>% 
  group_by(studying) %>% 
  summarise(week = seq(across()$start_dates,
                       across()$end_dates,
                       by = "1 week"))
#> `summarise()` has grouped output by 'studying'. You can override using the
#> `.groups` argument.
#> # A tibble: 134 x 2
#> # Groups:   studying [7]
#>    studying week               
#>    <chr>    <dttm>             
#>  1 period1  2019-05-08 00:00:00
#>  2 period1  2019-05-15 00:00:00
#>  3 period1  2019-05-22 00:00:00
#>  4 period1  2019-05-29 00:00:00
#>  5 period1  2019-06-05 00:00:00
#>  6 period1  2019-06-12 00:00:00
#>  7 period1  2019-06-19 00:00:00
#>  8 period1  2019-06-26 00:00:00
#>  9 period1  2019-07-03 00:00:00
#> 10 period1  2019-07-10 00:00:00
#> # … with 124 more rows

^{Created on 2022-01-19 by the reprex package (v0.3.0)}

Thanks everyone for the input. Some really interesting approaches and all great. I chose this as top answer as I guess it might be more in the "across" spirit. What is the `data.frame()` for out of interest. Does it not produce the same without? — Dasr, Jan 20 '22 at 18:52
@Dasr: True, we don't need the `data.frame` call. Nice spot. I just took your code and replaced `do` with `summarise` and the dot `.` with `across` without realizing that we don't need to wrap this in `data.frame` anymore :) — TimTeaFan, Jan 20 '22 at 20:10

G. Grothendieck · Answer 3 · 2022-01-19T23:33:09.953

Although marked Experimental the help file for group_modify does say that

‘group_modify()’ is an evolution of ‘do()’

and, in fact, the code for the example in the question using group_modify is nearly the same as with do.

# with group_modify
df2 <- df1 %>% 
  group_by(studying) %>% 
  group_modify(~ data.frame(week = seq(.$start_dates, .$end_dates, by = "1 week")))

# with do
df0 <- df1 %>% 
  group_by(studying) %>% 
  do(data.frame(week = seq(.$start_dates, .$end_dates, by = "1 week")))

identical(df2, df0)
## [1] TRUE

score 2 · Answer 4 · answered Jan 19 '22 at 16:23

2

Not sure if this exactly what you are looking for, but here is my attempt with rowwise and unnest

df1 %>% 
  rowwise() %>% 
  mutate(week = list(seq(start_dates, end_dates, by = "1 week"))) %>% 
  select(studying, week) %>% 
  unnest(cols = c(week))

answered Jan 19 '22 at 16:23

biobier

40
5

Cool approach, I'll leave it up a little while and see what other lines people are taking :) – Dasr Jan 19 '22 at 16:25

score 0 · Answer 5 · answered Jan 20 '22 at 06:15

Another approach:

library(tidyverse)

df1 %>%
    group_by(studying) %>%
    summarise(df = tibble(weeks = seq(start_dates, end_dates, by = 'week'))) %>%
    unnest(df)
#> `summarise()` has grouped output by 'studying'. You can override using the `.groups` argument.
#> # A tibble: 134 × 2
#> # Groups:   studying [7]
#>    studying weeks              
#>    <chr>    <dttm>             
#>  1 period1  2019-05-08 00:00:00
#>  2 period1  2019-05-15 00:00:00
#>  3 period1  2019-05-22 00:00:00
#>  4 period1  2019-05-29 00:00:00
#>  5 period1  2019-06-05 00:00:00
#>  6 period1  2019-06-12 00:00:00
#>  7 period1  2019-06-19 00:00:00
#>  8 period1  2019-06-26 00:00:00
#>  9 period1  2019-07-03 00:00:00
#> 10 period1  2019-07-10 00:00:00
#> # … with 124 more rows

^{Created on 2022-01-20 by the reprex package (v2.0.1)}

do() superseded! Alternative is to use across(), nest_by(), and summarise, how?

5 Answers5