How to join two dataframes containing time varying variables in R

Question

This seems like a simple enough thing but I can't figure it out nor find an answer online - apologies if it something obvious. I have two seperate dataframes containing the same patients with the same unique identifier. Both datasets have time varying variables - one continuous and one categorical and the time to each reading is different in the sets but have a common start point at time 1. I have tried to modify the tmerge function from survival package but without luck as I don't have a dichotomous outcome variable nor a single data set with one row per patient.

Reprex for creating the datasets below (df1 and df2) and an example of my desired combined output table for a single patient (ID 3), output gets very long if done for all 4 patients

Thanks for any possible help

df1 <- structure(list(tstart = c(1, 1, 1, 1426, 1, 560, 567), tstop = c(2049, 
3426, 1426, 1707, 560, 567, 4207), category = structure(c(1L, 
1L, 1L, 2L, 1L, 4L, 2L), .Label = c("none", "high", "low", "moderate"
), class = "factor"), id = c(1L, 2L, 3L, 3L, 4L, 4L, 4L)), row.names = c(NA, 
-7L), class = c("tbl_df", "tbl", "data.frame"))


df2 <- structure(list(tstart = c(1, 365, 730, 1, 365, 730, 1096, 2557, 
1, 365, 730, 1096, 1826, 2557, 3652, 1), tstop = c(365, 730, 
1096, 365, 730, 1096, 2557, 2582, 365, 730, 1096, 1826, 2557, 
3652, 4864, 365), egfr = c(66, 62, 58, 54, 50, 43, 49, 51, 106, 
103, 80, 92, 97, 90, 81, 51), id = c(1L, 1L, 1L, 2L, 
2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 4L)), row.names = c(NA, -16L), class = c("tbl_df", 
"tbl", "data.frame"))

df_example_patient_3 <- structure(list(id = c(3L, 3L, 3L,
3L, 3L, 3L,3L, 3L, 3L), tstart = c(1, 365, 730, 1096, 1426, 1707, 1826, 2557, 3652), tstop = c(365, 730, 
1096, 1426, 1707, 1826, 2557, 3652, 4864), egfr = c(106, 103, 80, 92, 92, 92, 97, 90, 81), category = c("none", "none", "none", "none", "high", "high", "high", "high", "high")), row.names = c(NA, -9L), class = c("tbl_df", 
"tbl", "data.frame"))

# DF1
  tstart tstop category    id
   <dbl> <dbl> <fct>    <int>
1      1  2049 none         1
2      1  3426 none         2
3      1  1426 none         3
4   1426  1707 high         3
5      1   560 none         4
6    560   567 moderate     4
7    567  4207 high         4

# DF2
   tstart tstop  egfr    id
    <dbl> <dbl> <dbl> <int>
 1      1   365    66     1
 2    365   730    62     1
 3    730  1096    58     1
 4      1   365    54     2
 5    365   730    50     2
 6    730  1096    43     2
 7   1096  2557    49     2
 8   2557  2582    51     2
 9      1   365   106     3
10    365   730   103     3
11    730  1096    80     3
12   1096  1826    92     3
13   1826  2557    97     3
14   2557  3652    90     3
15   3652  4864    81     3
16      1   365    51     4

# Combined set
     id tstart tstop  egfr category
  <int>  <dbl> <dbl> <dbl> <chr>   
1     3      1   365   106 none    
2     3    365   730   103 none    
3     3    730  1096    80 none    
4     3   1096  1426    92 none    
5     3   1426  1707    92 high    
6     3   1707  1826    92 high    
7     3   1826  2557    97 high    
8     3   2557  3652    90 high    
9     3   3652  4864    81 high

@benson23 this is much more complicated than an equality join. Non-equi joins from `fuzzyjoin` or `data.table` get close, but would require some pre- or post-processing for when the start and stop times don't line up between the data sets. This is probably the way to go. — Gregor Thomas, Feb 26 '22 at 06:53
My apologies, when I'm checking OP's data, I thought all the elements from `DF1` match with `DF2`, I definitely overlooked `tstart` and `tstop` — benson23, Feb 26 '22 at 07:03
Please provide the exact rules to go from the input to the output. — G. Grothendieck, Feb 26 '22 at 17:44

L X · Answer 1 · 2022-02-26T17:28:30.723

I had to do it this way to really work out the details. First, i construct a full df1 with all the timestamps, including those of df2.

then i proceed with multiple merges. This is not elegant, but it works:

library(data.table)
library(zoo)

# Proper data.tables
setDT(df1, key = c("id", "tstart"))
setDT(df2, key = c("id", "tstart"))

timestamps_by_id <- unique(rbind(
  df1[, .(id, tstart)],
  df1[, .(id, tstop)],
  df2[, .(id, tstart)],
  df2[, .(id, tstop)],
  use.names = F
))
setorder(timestamps_by_id, id, tstart)

# Merge to construct full df1
df1_full <- df1[timestamps_by_id]
df1_full[, category := na.locf(category), by = id]
df1_full[, tstop := shift(tstart, -1), by = id]
setkey(df1_full, id, tstart)

# Merge with df2
result <- na.omit(df2[df1_full, roll = T])
result[, tstop := i.tstop]
print(result[id == 3, .(id, tstart, tstop, egfr, category)])

Or a more data.tabley solution using the more arcane foverlaps:

library(data.table)

# Proper data.tables
setDT(df1, key = c("id", "tstart", "tstop"))
setDT(df2, key = c("id", "tstart", "tstop"))

# We add an infinite upper range
proper_df1 <- rbind(
  df1,
  df1[, .SD[which.max(tstop)], by = .(id)][, .(id, tstart = tstop, tstop = Inf, category), ]
)

setkey(proper_df1, id, tstart, tstop)

overlaps <- foverlaps(df2, proper_df1, type = "any") # Overlap join

overlaps[
tstart %between% .(i.tstart, i.tstop) & tstart != 1, 
i.tstart := tstart
]
overlaps[tstop %between% .(i.tstart, i.tstop), i.tstop := tstop]

print(overlaps[
  id == 3,
  .(id, "tstart" = i.tstart, "tstop" = i.tstop, category, egfr)
])

Thank you LX! You solution seems to work for me (although don't understand the code very well as haven't used those packages before), I did an "all.equal" function between your output and mine and were the same (apart from the format of some of the variables) but I suspect your code will be more robust to different datasets, I think mine will need to be altered depending on particular layout of the data, thanks again! — Ryan, Feb 27 '22 at 05:08

Ryan · Answer 2 · 2022-02-27T05:04:43.997

This messy dplyr solution seems to work for this particular dataset but don't know would it work for all datasets, the direction of the fill may need to be altered depending on particular dataset

library(tidyverse)
library(magrittr)



df1 %>%
  bind_rows(df2) %>% 
  group_by(id) %>% 
  arrange(id, tstop) %>% 
  mutate(
    tstart = case_when(
      tstart < lag(tstop) ~ lag(tstop), TRUE ~ tstart)) %>% 
  fill(egfr, category, .direction = "updown") %>% 
  ungroup() %>% 
  filter(id == 3)

  tstart tstop category    id  egfr
   <dbl> <dbl> <fct>    <int> <dbl>
1      1   365 none         3   106
2    365   730 none         3   103
3    730  1096 none         3    80
4   1096  1426 none         3    92
5   1426  1707 high         3    92
6   1707  1826 high         3    92
7   1826  2557 high         3    97
8   2557  3652 high         3    90
9   3652  4864 high         3    81

How to join two dataframes containing time varying variables in R

2 Answers2