How to fill in missing dates in range by group

Question

I have a data.frame of groups and dates. How do I fill in all the missing dates in the min-max date range for each group?

Ideally I would do this in dplyr. But ultimately, I'd just like to do this efficiently with as few lines of (readable) code as possible. Below is a minimal example. I actually have many dates and groups. Both of my approaches look kind of ugly. There has to be a better way, right?

#### setup ####

library(sqldf)
library(dplyr)
df <- data.frame(the_group = rep(LETTERS[1:2], each=3), date = Sys.Date() + c(0:2, 1:3), stringsAsFactors = F) %>%
  tbl_df() %>%
  slice(-2) # represents that I may be missing data in a range!

#### dplyr approach with cross join dummy ####
full_seq <- data.frame(cross_join_dummy = 1, date = seq.Date(from=min(df$date), to=max(df$date), by = "day"))

range_by_group <- df %>%
  group_by(the_group) %>%
  summarise(min_date = min(date), max_date = max(date)) %>%
  ungroup() %>%
  mutate(cross_join_dummy = 1)

desired <- range_by_group %>%
  inner_join(full_seq, by="cross_join_dummy") %>%
  filter(date >= min_date, date <= max_date) %>%
  select(the_group, date)

#### sqldf approach ####
full_seq <- data.frame(date = as.character(seq.Date(from=min(df$date), to=max(df$date), by="day")))

df <- df %>%
  mutate(date = as.character(date))

range_by_group <- sqldf("
                  SELECT the_group, MIN(date) AS min_date, MAX(date) AS max_date
                  FROM df
                  GROUP BY the_group
                  ")

desired <- sqldf("
            SELECT rbg.the_group, fs.date
            FROM range_by_group rbg
            JOIN full_seq fs
              ON fs.date BETWEEN rbg.min_date AND rbg.max_date
            ")

G. Grothendieck · Accepted Answer · 2016-10-27T19:18:35.843

1) no packages - by

This does not use any packages. by splits df by df$the_group and then performs the indicated operation on each one. do.call("rbind", ...) puts the groups back together.

seq_date <- function(x) seq(min(x), max(x), by = "day")
do.call("rbind", by(df, df$the_group, with, 
  data.frame(the_group = the_group[1], date = seq_date(date))))

2) data.table Here is a solution using data.table. seq_date is from (1)

library(data.table)

dt <- as.data.table(df)
dt[, list(date = seq_date(date)), by = the_group]

3) tidyverse This uses map_df from purrr to apply the function given in formula notation over the groups and to put the result together into a data frame. data_frame is from the tibble package. seq_date is from (1).

library(tidyverse)

df %>% 
   split(.$the_group) %>% 
   map_df(~ data_frame(the_group = .$the_group[1], date = seq_date(.$date)))

4) tapply

4a) tapply - tidyr/reshape2 seq_date is from (1).

library(tidyr)
library(reshape2)

df %>%
   { tapply(.$date, .$the_group, seq_date, simplify = FALSE) } %>%
   melt %>%
   unnest

4b) tapply - no packages The last line pieces together the output of tapply avoiding the need for any packages. seq_date is from (1).

ta <- tapply(df$date, df$the_group, seq_date, simplify = FALSE)
data.frame(the_group = rep(names(ta), lengths(ta)), date = do.call("c", ta))

4c) tapply - lattice We can use the lattice package's make.groups on ta from (4b). lattice is pre-installed with R so it does not involve installing any additional packages. Unfortunately make.groups removes the Date class attribute so we have to put it back. Also make.groups uses which and data column names so we fix up the column names.

library(lattice)
with(do.call("make.groups", ta), 
  data.frame(the_group = which, date = structure(data, class = "Date")))

4d) tapply - no packages - stack We can use stack to convert ta from (4b) to the desired form provided we remove the "Date" class first. Then after applying stack we can restore the "Date" class. stack uses hard coded column names which we replace using setNames.

stack_dates <- function(x) 
  transform(stack(lapply(x, as.vector)), values = structure(values, class = "Date"))
setNames(stack_dates(ta)[2:1], c("the_group", "date"))

Still having trouble understanding why the "no packages" approach works. Especially with the `the_group[1]` and `with` — lowndrul, Oct 25 '16 at 00:08
It's the same as `do.call("rbind", by(df, df$the_group, function(x) data.frame(the_group = x$the_group[1], date = seq(min(x$date), max(x$date), by = "day"))))` except we used `with` to shorten it. `the_group` is constant within a group so we just used the first component as you can't mix two different length vectors in a data frame unless one of them has length 1. — G. Grothendieck, Oct 25 '16 at 00:13
A slightly expanded answer in light of improvements today (OCT 2017) would use `nest()` functionality instead of the split, and map combination the OP uses here for his `tidyverse` example. Slightly cleaner presentation. Otherwise, the logic here remains the same. — jacobsg, Oct 26 '17 at 15:56

How to fill in missing dates in range by group

1 Answers1