-1

Sorry if this is a duplicate but could not quite find what was right for me.

I have a Dataframe 1 - where Episode_ID is unique but patient_ID is not

  Episode_ID  Patient_ID  Drug   Commenced_date  
1 1             0           A      2018-01-02    
2 2             0           B      2020-11-26    
3 3             1           B      2017-09-26    
4 4             2           B      2020-11-26  

and Dataframe 2 where Visit_ID is unique but Patient_ID is not

  Patient_ID  Visit_ID   Visit_date  Visit_Score  
1  0           1         2018-01-22   2.3  
2  0           2         2019-06-01   1.5  
3  0           3         2020-11-26   1.6 
4  1           B         2020-11-26   1.4

What I want to achieve is for each episode (from df1) to be matched to that patients closest visit in (df2) by looping through all the visit_dates (matched by patient_ID) in df 2 and have the associated visit_score appended. As each row in Df1 is iterated, the patient ID can be matched to patient ID in df2. So the first row, should be a 1 to 3 match. datediff with which.min can then be applied to find the closest match. Its the coding syntax that I really can't get down right.

For example, based on the above. The result would be:

  Episode_ID  Patient_ID  Drug   Commenced_date  Closest_Visit Visit_score
1 1             0           A      2018-01-02    2018-01-22    2.3
2 2             0           B      2020-11-26    2020-11-26    1.6
3 3             1           B      2017-09-26    2018-01-22    2.3 
4 4             2           B      2020-11-26    2020-11-26    1.4

This should be quite simple but I've becoming really rusty in R and out of practice. I tried to use a nested for loop but it kept breaking. Then tried using sapply but couldn't quite get it to work.

Any help would be greatly appreciated. Happy for a solution in python as well but prefer R for now given the rest of the project at the moment.

  • 1
    This kind of task is more easily accomplished by joining your tables (by Patient_ID), then filtering to the closest date within each Patient_ID). See this SO question and answers: [Match two tables based on a time difference criterium](https://stackoverflow.com/questions/71033424/match-two-tables-based-on-a-time-difference-criterium) – zephryl Feb 13 '22 at 12:56

1 Answers1

1

There are a few options to consider for approaches with R that do not require loops.

I might consider data.table and join the data.frames with roll:

library(data.table)

setDT(df1)
setDT(df2)

df2[df1, on = c('Patient_ID', 'Visit_date' = 'Commenced_date'), roll = 'nearest']

Output

   Patient_ID Visit_ID Visit_date Visit_Score Episode_ID Drug
1:          0        1 2018-01-02         2.3          1    A
2:          0        3 2020-11-26         1.6          2    B
3:          1        B 2017-09-26         1.4          3    B
4:          2     <NA> 2020-11-26          NA          4    B

Or use sqldf with a join of the two data.frames:

library(sqldf)

sqldf("select df1.*, df2.*, min(abs(df1.Commenced_date - df2.Visit_date)) diff
       from df1
       left join df2 using(Patient_ID)
       group by df1.rowid") 

Output

  Episode_ID Patient_ID Drug Commenced_date Patient_ID Visit_ID Visit_date Visit_Score diff
1          1          0    A     2018-01-02          0        1 2018-01-22         2.3   20
2          2          0    B     2020-11-26          0        3 2020-11-26         1.6    0
3          3          1    B     2017-09-26          1        B 2020-11-26         1.4 1157
4          4          2    B     2020-11-26         NA     <NA>       <NA>          NA   NA

In this case, leaving in a calculated difference in dates.


If you did want a tidyverse approach, you could a join and then filter or slice to keep the row with minimum difference in dates:

library(tidyverse)

left_join(df1, df2, by = "Patient_ID") %>%
  group_by(Patient_ID, Episode_ID) %>%
  slice_min(abs(Visit_date - Commenced_date))

Output

  Episode_ID Patient_ID Drug  Commenced_date Visit_ID Visit_date Visit_Score
       <int>      <int> <chr> <date>         <chr>    <date>           <dbl>
1          1          0 A     2018-01-02     1        2018-01-22         2.3
2          2          0 B     2020-11-26     3        2020-11-26         1.6
3          3          1 B     2017-09-26     B        2020-11-26         1.4

In this case, used slice_min which will omit rows in df1 that did not match by Patient_ID on df2.

Ben
  • 28,684
  • 5
  • 23
  • 45