I have a data.frame
of groups and dates. How do I fill in all the missing dates in the min-max date range for each group?
Ideally I would do this in dplyr
. But ultimately, I'd just like to do this efficiently with as few lines of (readable) code as possible. Below is a minimal example. I actually have many dates and groups. Both of my approaches look kind of ugly. There has to be a better way, right?
#### setup ####
library(sqldf)
library(dplyr)
df <- data.frame(the_group = rep(LETTERS[1:2], each=3), date = Sys.Date() + c(0:2, 1:3), stringsAsFactors = F) %>%
tbl_df() %>%
slice(-2) # represents that I may be missing data in a range!
#### dplyr approach with cross join dummy ####
full_seq <- data.frame(cross_join_dummy = 1, date = seq.Date(from=min(df$date), to=max(df$date), by = "day"))
range_by_group <- df %>%
group_by(the_group) %>%
summarise(min_date = min(date), max_date = max(date)) %>%
ungroup() %>%
mutate(cross_join_dummy = 1)
desired <- range_by_group %>%
inner_join(full_seq, by="cross_join_dummy") %>%
filter(date >= min_date, date <= max_date) %>%
select(the_group, date)
#### sqldf approach ####
full_seq <- data.frame(date = as.character(seq.Date(from=min(df$date), to=max(df$date), by="day")))
df <- df %>%
mutate(date = as.character(date))
range_by_group <- sqldf("
SELECT the_group, MIN(date) AS min_date, MAX(date) AS max_date
FROM df
GROUP BY the_group
")
desired <- sqldf("
SELECT rbg.the_group, fs.date
FROM range_by_group rbg
JOIN full_seq fs
ON fs.date BETWEEN rbg.min_date AND rbg.max_date
")