R coding in Databricks.
I want the time intervals in hours between date1 entries, arranged by pid, med and date1.
I want the latest date1 entry in a sequence of events per date to be manually adjustable to 24 hours.
A cohort is when pid, med and date1 is the same.
Any change will terminate the last cohort with an hour_output == 24.
df
pid med date1
1 1 drugA 2023-02-02 09:00:00
2 1 drugA 2023-02-02 12:00:00
3 1 drugA 2023-02-02 14:00:00
4 1 drugB 2023-02-03 10:00:00
5 1 drugB 2023-02-03 18:00:00
What script was attempted.
df1 <- df %>%
arrange(pid, med, date1) %>%
mutate(hours_output = as.numeric(difftime(lead(date1), date1, units = "hours")))
# Replace the last duration value with 24 hours
df1$hours_output[last(nrow(df1))] <- 24
df1 <- df1 %>% select(med, date1, hours_output)
head(df1)
Actual output
pid med date1 hours_output
1 1 drugA 2023-02-02 09:00:00 3.00
2 1 drugA 2023-02-02 12:00:00 2.00
3 1 drugA 2023-02-02 14:00:00 20.00
4 1 drugB 2023-02-03 10:00:00 8.00
5 1 drugB 2023-02-03 18:00:00 18.00 (18 hours to the next row - not shown)
Desired output
pid med date1 hours_output
1 1 drugA 2023-02-02 09:00:00 3.00
2 1 drugA 2023-02-02 12:00:00 2.00
3 1 drugA 2023-02-02 14:00:00 24.00
4 1 drugB 2023-02-03 10:00:00 8.00
5 1 drugB 2023-02-03 18:00:00 24.00