I have a dataset where each row is identified by a hospitalization id. Each row contains information on the hospitalization id, the dates of admission and discharge, as well as the identification of the hospital where it took place and the physician responsible for it.
I would like to know, for each hospitalization, the id of all other hospitalizations concluded in the 30 days before the beginning of the given hospitalization that were performed by other physicians in the same hospital.
Below is a simple example of 8 hospitalizations performed by 2 physicians in 2 hospitals (physicians may work in more than one hospital).
library("tidyverse")
df <- data.frame(hospitalization_id = c(1, 2, 3,
4, 5,
6, 7, 8),
hospital_id = c("A", "A", "A",
"A", "A",
"B", "B", "B"),
physician_id = c(1, 1, 1,
2, 2,
2, 2, 2),
date_start = as.Date(c("2000-01-01", "2000-01-12", "2000-01-20",
"2000-01-12", "2000-01-20",
"2000-02-10", "2000-02-11", "2000-02-12")),
date_end = as.Date(c("2000-01-03", "2000-01-18", "2000-01-22",
"2000-01-18", "2000-01-22",
"2000-02-11", "2000-02-14", "2000-02-17")))
Using the solution posted in R (dplyr): find all rows in row-specific range: first, I find all other hospitalizations during the 30-day period prior to the start of the given hospitalization in the given hospital; then, I drop those hospitalizations that were performed by the same physician.
df_with_date_range <- df %>%
mutate(date_range1 = date_start - 31,
date_range2 = date_start - 1)
df_semifinal <- df_with_date_range %>%
rowwise() %>%
mutate(hospital_id_in_range = pmap(list(date_range1, date_range2, hospital_id),
function(x, y, z) ungroup(filter(rowwise(df_with_date_range),
between(date_end, x, y),
hospital_id == z))$hospitalization_id)) %>%
unnest(hospital_id_in_range, keep_empty = TRUE)
df_final <- df_semifinal %>%
left_join(select(df, hospitalization_id, physician_id),
by = c('hospital_id_in_range' = 'hospitalization_id')) %>%
mutate(hospital_id_in_range = ifelse(physician_id.x == physician_id.y, NA, hospital_id_in_range)) %>%
select(-physician_id.y) %>%
rename(physician_id = physician_id.x) %>%
distinct()
I am trying to write a more efficient code given that my data is massive - ideally I would like to avoid just adding all hospitalizations and then dropping the ones performed by the given physician.