1

I have a dataset like this. The date_e was accurate for status= "1". I want to simulate date_e based on age. Therefore, new_date_e will be changed for status="0", will be same for status="1". Also, status=1 has higher risk, so df= date_e-age should be in average shorter for status="1"than "0".

           age      date_e  status  id
1   1950-10-21 2008-11-02      0   1
2   1941-02-11 2006-08-28      0   2
3   1940-01-20 2000-05-25      0   3
4   1957-11-05 2008-03-28      1   4
5   1946-09-15 2004-03-10      0   5

and the data is :

library(dplyr)

set.seed(1)

age <- sample(seq(as.Date('1930-01-01'), as.Date('1970-01-01'), by="day"), 1000)
date1 <- sample(seq(as.Date('2000-01-01'), as.Date('2010-01-01'), by="day"), 1000)
status <- sample(c(0, 1), size = 1000, replace = TRUE, prob = c(0.8, 0.2))
df <- data.frame(age, date1, status)
df <- df %>% mutate(id = row_number())
Rui Barradas
  • 70,273
  • 8
  • 34
  • 66
Ali Roghani
  • 495
  • 2
  • 7
  • 3
    Can you post the expected output? Also, when simulating data, please include a call to `?set.seed` before calling pseudo-RNG functions. – Rui Barradas Mar 15 '22 at 05:14
  • 1
    @Ali Roghani, I am sorry but the question is totally unclear. What does it mean " new_date_e will be changed for status="0", will be same for status="1"" ? Changed how? What do you mean "higher risk"? If you need help you must absolutely give more details and much more explanations. – Fabrizio Mar 15 '22 at 08:13
  • Hello, I mean when you make a new column based on data_e, you will call it new_date_e, which has the same date by date_e for sataus="1", while we will use simulated dates for status="0". Regarding higher risk, I assumed that people with higher age has higher risk so they sholud have an earlier date_e (hospitalization), also status=1 will have more risk than status=1, so df1= date_e(1)-age(e) for the status=1 is smaller. – Ali Roghani Mar 15 '22 at 15:09

1 Answers1

1

I guess what you are wanting to simulate is the effect of status on longevity (i.e. the time difference between date1 and age in your reproducible example). At the moment, status has no effect on longevity:

library(ggplot2)

df %>%
ggplot(aes(x    = factor(status), 
           y    = as.numeric(difftime(date1, age, unit = 'w'))/52,
           fill = factor(status))) + 
  geom_boxplot(width = 0.6) +
  guides(fill = guide_none()) +
  labs(x = 'Status', y = 'Age (years)')

enter image description here

Effectively, what you need to do is to subtract a random amount of time from the date1 column where status == 1. To do this, you can take advantage of the fact that dates are stored as integers 'under the hood' in R, and the fact that you can multiply a random draw by the status column, since those with status == 0 will thereby always have 0 subtracted.

So the answer is that you only need do:

df$date1 <- df$date1 - df$status * round(rnorm(nrow(df), 3650, 500))

Which will remove on average 10 years from those with status == 1 but leave those with status == 0 as-is:

df %>% 
  ggplot(aes(x    = factor(status), 
             y    = as.numeric(difftime(date1, age, unit = 'w'))/52,
             fill = factor(status))) + 
  geom_boxplot(width = 0.6) +
  guides(fill = guide_none()) +
  labs(x = 'Status', y = 'Age (years)')

enter image description here

Allan Cameron
  • 147,086
  • 7
  • 49
  • 87
  • Thank you this is great help, but your date_1 in all cases is coming down, while I think it is better that a part of those have a higher date, and mostly have a lower date. – Ali Roghani Mar 15 '22 at 15:22