R (dplyr): find all rows in row-specific range WITH RESTRICTION

Question

I have a dataset where each row is identified by a hospitalization id. Each row contains information on the hospitalization id, the dates of admission and discharge, as well as the identification of the hospital where it took place and the physician responsible for it.

I would like to know, for each hospitalization, the id of all other hospitalizations concluded in the 30 days before the beginning of the given hospitalization that were performed by other physicians in the same hospital.

Below is a simple example of 8 hospitalizations performed by 2 physicians in 2 hospitals (physicians may work in more than one hospital).

library("tidyverse")
df <- data.frame(hospitalization_id = c(1, 2, 3, 
                                        4, 5, 
                                        6, 7, 8),
                 hospital_id = c("A", "A", "A", 
                                 "A", "A",
                                 "B", "B", "B"),
                 physician_id = c(1, 1, 1, 
                                  2, 2, 
                                  2, 2, 2),
                 date_start = as.Date(c("2000-01-01", "2000-01-12", "2000-01-20",
                                        "2000-01-12", "2000-01-20",
                                        "2000-02-10", "2000-02-11", "2000-02-12")),
                 date_end = as.Date(c("2000-01-03", "2000-01-18", "2000-01-22",
                                      "2000-01-18", "2000-01-22",
                                      "2000-02-11", "2000-02-14", "2000-02-17")))

Using the solution posted in R (dplyr): find all rows in row-specific range: first, I find all other hospitalizations during the 30-day period prior to the start of the given hospitalization in the given hospital; then, I drop those hospitalizations that were performed by the same physician.

df_with_date_range <- df %>%
  mutate(date_range1 = date_start - 31,
         date_range2 = date_start - 1)

df_semifinal <- df_with_date_range %>%
  rowwise() %>%
  mutate(hospital_id_in_range = pmap(list(date_range1, date_range2, hospital_id),
                                     function(x, y, z) ungroup(filter(rowwise(df_with_date_range),
                                                                      between(date_end, x, y),
                                                                      hospital_id == z))$hospitalization_id)) %>%
  unnest(hospital_id_in_range, keep_empty = TRUE)

df_final <- df_semifinal %>% 
  left_join(select(df, hospitalization_id, physician_id),
            by = c('hospital_id_in_range' = 'hospitalization_id')) %>%
  mutate(hospital_id_in_range = ifelse(physician_id.x == physician_id.y, NA, hospital_id_in_range)) %>%
  select(-physician_id.y) %>%
  rename(physician_id = physician_id.x) %>%
  distinct()

I am trying to write a more efficient code given that my data is massive - ideally I would like to avoid just adding all hospitalizations and then dropping the ones performed by the given physician.

Do you want a `dplyr` solution or are other packages also ok? Usually, `data.table` is better suited for large data sets. — starja, Aug 15 '22 at 19:41
Thank you @starja - it would be great if you could suggest a code with `data.table` as well! The only issue is that the rest of my code uses `dplyr` - is it ok to mix them throughout the code or is this not recommended? — PaulaSpinola, Aug 15 '22 at 22:44
Also, with my solution, I realized that you can eliminate the first `rowwise` as well with no issue, which should also significantly speed up the process. You can try my edits if you want to. — dcsuka, Aug 16 '22 at 03:03

dcsuka · Accepted Answer · 2022-08-16T03:02:19.687

What about just adding another filtering criteria for physician in my previous solution as such:

df_with_date_range %>%
  mutate(hospital_id_in_range = pmap(list(date_range1, date_range2, hospital_id, physician_id),
                   function(x, y, z, p) filter(df_with_date_range,
                                                 date_start >= x & date_start <= y,
                                                 hospital_id == z,
                                                 physician_id != p)$hospitalization_id)) %>%
  unnest(hospital_id_in_range, keep_empty = TRUE)

# # A tibble: 9 × 8
#   hospitalization_id hospital_id physician_id date_start date_end   date_range1 date_range2 hospital_id_in_range
#                <dbl> <chr>              <dbl> <date>     <date>     <date>      <date>                     <dbl>
# 1                  1 A                      1 2000-01-01 2000-01-03 1999-12-01  1999-12-31                    NA
# 2                  2 A                      1 2000-01-12 2000-01-18 1999-12-12  2000-01-11                    NA
# 3                  3 A                      1 2000-01-20 2000-01-22 1999-12-20  2000-01-19                     4
# 4                  4 A                      2 2000-01-12 2000-01-18 1999-12-12  2000-01-11                     1
# 5                  5 A                      2 2000-01-20 2000-01-22 1999-12-20  2000-01-19                     1
# 6                  5 A                      2 2000-01-20 2000-01-22 1999-12-20  2000-01-19                     2
# 7                  6 B                      2 2000-02-10 2000-02-11 2000-01-10  2000-02-09                    NA
# 8                  7 B                      2 2000-02-11 2000-02-14 2000-01-11  2000-02-10                    NA
# 9                  8 B                      2 2000-02-12 2000-02-17 2000-01-12  2000-02-11                    NA

Also, I switched the between to a better vectorized implementation that does not require a second rowwise. Using more rowwise tends to slow the procedure down. This one should be a bit faster.

The expected output might be a bit different than yours because I filter during the procedure instead of replacing with NAs afterwards. If you want to replace physician copies with NAs, let me know and I can code a mutate of the filtered data frame to do that.

there seems to be no `rowwise` in your code any longer. You said you eliminated the second `rowwise` but I also don't see it anywhere in your code. Just checking :) Thank you! — PaulaSpinola, Aug 16 '22 at 07:37
Yes, I got rid of the `rowwise` as it was actually unnecessary, this is why you do not see it after my edits. The `rowwise` was not needed because of the `pmap` and the code should run much faster this way. Let me know if you have any further questions! — dcsuka, Aug 16 '22 at 15:41
Thank you!! Now I am trying to sort out another complication from my data: the same hospitalization may involve more than one physician. I would only like to add the other hospitalizations that do not involve the given physician. E.g. for instance, if I am a physician who have worked in the past 30 days with someone else in another hospitalization, I do not want this hospitalization to be added to the dataset. I would like only hospitalizations that did not involve me to be added to the dataset. — PaulaSpinola, Aug 16 '22 at 16:08
I have updated the example and managed to write a code where I find (i) a list with all the hospitalizations in the past 30 days in the hospital, (ii) a list of hospitalizations performed in the past 30 days by the given physician. Then, I take the elements in list (i) which are not in (ii). The only issue is that a same hospitalization shows up multiple times in list (i) in case it involves multiple physicians. Could I share my code with you? If so, where should I post it: here or do I create a new thread? — PaulaSpinola, Aug 16 '22 at 16:26
I'd be happy to take a look. Just make a new thread, as it is a different question. Make sure to highlight where the current code goes wrong, and what the expected output is. — dcsuka, Aug 16 '22 at 17:59
Thank you @dcsuka! I have just posted it here: https://stackoverflow.com/questions/73379338/r-dplyr-find-set-of-rows-in-row-specific-range-with-restriction-at-different — PaulaSpinola, Aug 16 '22 at 19:26
I added the last layer of complexity to this exercise - this is what I really need to do with my real data! Thanks so much for your help! — PaulaSpinola, Aug 16 '22 at 19:27

R (dplyr): find all rows in row-specific range WITH RESTRICTION

1 Answers1

Linked

R (dplyr): find all rows in row-specific range *WITH RESTRICTION*

1 Answers1

Linked

R (dplyr): find all rows in row-specific range WITH RESTRICTION