I'm working with data describing events having start and end time. For instance, it could be in a form of:
I'd like to convert this data to a form where I could count occurrences of events in regular intervals in order to be able to answer question of how many events were taking place from 13:15:00
to 13:29:59
and 13:30:00
to 13:45:00
and so on. In the example above the second record could be counted in first slot simply using start time, but wouldn't be counted as 'ongoing' in the second regular interval, even when end time was integrated.
To work with 15 min intervals I came up with a clumsy solution that uses tidyr::uncount
to 'expand' dataset to 24*4=96
15 min intervals through the 24h period and then filtering ones that are inside specified intervals.
library(tidyverse)
library(lubridate)
library(magrittr)
df1 <- tibble::tibble(
id = c(1, 2),
start_date = c(ymd_hms("2018-12-10 14:45:51", tz = "Australia/Brisbane"),
ymd_hms("2018-12-10 13:29:37", tz = "Australia/Brisbane")),
end_date = c(ymd_hms("2018-12-10 14:59:04", tz = "Australia/Brisbane"),
ymd_hms("2018-12-10 14:02:37", tz = "Australia/Brisbane")))
df2 <- df1 %>%
mutate(episode = 96) %>%
tidyr::uncount(episode, .id = "sequence")
df2$int_start <- rep(
seq(ymd_hms("2018-12-10 00:00:00", tz = "Australia/Brisbane"),
ymd_hms("2018-12-10 23:59:59", tz = "Australia/Brisbane"),
by = "15 mins"),
2)
df2$int_end <- df2$int_start + 899
df2 %<>%
filter(int_end > start_date & int_start < end_date )
That gives me data in desired format:
But I have a feeling there must be smarter way to do that. Expansion for instance will not be great solution for very large dataset and/or when the time interval is small (I think). It'll probably be also a pain to extend it to work with multiple days (I think).
In Stata one could also use stsplit
command to accomplish something similar. I tried tinkering with survSplit
from survival
package but ended up with lots of records as well:
df1$status <- 1
df1$start_date <- as.numeric(df1$start_date)
df1$end_date <- as.numeric(df1$end_date)
df3 <- survSplit(Surv(end_date, status) ~., df1,
cut=seq(from=as.numeric(as.POSIXct("2018-12-10 00:00:00")),
to=as.numeric(as.POSIXct("2018-12-10 00:00:00")) + 24*60*60,
by=900),
start = "start_int",
id="new_id",
episode ="episode")
df3$start_int <- as.POSIXct(df3$start_int, origin = "1970-01-01", tz = "Australia/Brisbane")
df3$start_date <- as.POSIXct(df3$start_date, origin = "1970-01-01", tz = "Australia/Brisbane")
df3$end_date <- as.POSIXct(df3$end_date, origin = "1970-01-01", tz = "Australia/Brisbane")
Any pointers to a better way to achieve such task?