0

I have a dataset with a large number of individuals with an exposure at an individual event and was provided with a control group of similar sex distribution and similar date of birth.

The observation covers a period of multiple years, e.g. 10 years, and I want to compare exposed individuals to non-exposed individuals matched for sex and age at the time of exposure in the exposed group.

A dataframe could look like this:

set.seed(1234)
df <- tibble(
  id = seq(1:10000),
  sex = sample(x = c("male", "female"), size = 10000, replace = T),
  exposed = sample(x = 0:1, 10000, replace = T, prob = c(0.66, 0.33)),
  dead = sample(x = 0:1, 10000, replace = T, prob = c(0.8, 0.2)),
  date_birth = sample(seq(as.Date("1930-01-01"), as.Date("1950-01-01"), by = "day"), size = 10000, replace = T),
  date_death = sample(seq(as.Date("2010-01-01"), as.Date("2022-01-01"), by = "day"), size = 10000, replace = T),
  date_exp = sample(seq(as.Date("2010-01-01"), as.Date("2022-01-01"), by = "day"), size = 10000, replace = T)
)

# set date to NA if not exposed / not dead
df <- df %>%
  mutate(
    date_exp = if_else(exposed == 1, date_exp, as.Date(NA)),
    date_death = if_else(dead == 1 & exposed == 0, date_death, as.Date(NA))
  )
 

Problem

To account for trends of other variables of interest over time, I need a control sample where each exposed individual is paired with non-exposed that is of same sex and similar age at the time of exposure. The age for the exposed group is of course date_exposed - date_birth, however, we have no such reference point for the control group. Additionally, since the control group was selected by others on average metrics (i.e. similar year of birth), some of the controls are deceased and their observational period overlaps only with a fraction of the exposed group.

Possible solution

I am having a hard time thinking of a feasible solution; my primary idea is to create a column for each day of the observation period (2010 - 2020) and calculate the age at each day until an individual is removed (death) from the DB. Then, for each exposed individual, I would select the control individual with the closest age on the day of exposure and the same sex. While this should work in principle, I don't know how to incorporate this into a matching pipeline to ensure optimal pairing.

Thank you for your help!

raphael
  • 1
  • 1
  • Seems to be a problem of survival analysis with censored data. I'm not an expert on the subject but `survival` package has tools to deal with such data. – Ric Nov 11 '22 at 13:31
  • I am not trying to model anything at this point, but I will certainly look closely into the package, thank you! – raphael Nov 13 '22 at 22:22

0 Answers0