0

I have a large data set taken over 13 years including many plots and a treatment. The response variable being tracked is mortality and the data is currently in a summed format. To run Kaplan-Meier or other survival analyses, I need the data in the unsummed format (each individual coded with their status 0=alive, 1=dead) and the date upon which they died, as well as plot and treatment they are from. A snippet of the data frame I have looks like this:

Current data frame format with mortality summed across individuals.

What I am trying to get to looks like this:

Desired data frame expanded so that individuals are coded by plot, treatment, and date of mortality, or given a zero if they are still alive (individuals to be censored).

Or this simpler version would be okay too:

An additional format for a desired data frame expanded so that individuals are coded by plot, treatment, and date of mortality.

chaparral
  • 3
  • 2

1 Answers1

0

We can solve with nested loops but it might take a while to execute if your dataset is really large. Assign the original dataset to a data frame called df.

#First Format will be slower due to the 2nd nested loop

new_df = data.frame()
for (row in 1:nrow(df)){
    n = as.integer(df$alive_count[row])
    if (n > 0){
       for (person in 1:n){
           new_entry = df[row,]
           new_entry$status = 0
           new_df = rbind(new_df,new_entry)
       }
    }
    m = as.integer(df$dead_count[row])
    if (m > 0){
       for (person in 1:m){
           new_entry = df[row,]
           new_entry$status = 1
           new_df = rbind(new_df,new_entry)
       }
    }
}
#Remove redundant columns
new_df$alive_count = NULL
new_df$dead_count = NULL
new_df$total_count = NULL

#Second Format will be faster and more recommended

new_df = data.frame()
for (row in 1:nrow(df)){
    n = as.integer(df$dead_count[row])
    if (n == 0){next}
    for (person in 1:n){
        new_entry = df[row,]
        new_entry$status = 1
        new_df = rbind(new_df,new_entry)
    }
}
#Remove redundant columns
new_df$alive_count = NULL
new_df$dead_count = NULL
new_df$total_count = NULL

Hope this helps! If anyone found a more efficient algorithm do let me know.

  • Thanks for this response, Matthew. It comes very close to working (for the first format) but it is consistently adding two dead rows when they should not be there. For example, when there is only one living individual (status = 0) and no dead individuals, the output for column status = 0,1,1, where status 1 = dead. It is doing the same with the second format: when there are zero dead it reports status = 1,1. – chaparral Oct 09 '22 at 17:51
  • Thanks for pointing this out. This occurs because when you for loop over 1:0 you actually get 2 iterations which accounts for the 2 extra incorrect entries you have observed. I have edited in an if statement to correct this error. Please try again and let me know if there are any errors. – Matthew Durai Oct 10 '22 at 12:40
  • That makes sense. The new code (first format) is not recoding any 1's, but is dong the zero's just as it should. Small detail: there is an extra parenthesis in line 5. – chaparral Oct 10 '22 at 13:24
  • Alright, I've edited the if statements for the first format to record both the 1s and 0s. The second format should already be working fine. Do update me if you run into further errors. – Matthew Durai Oct 10 '22 at 13:36
  • This solved the issue for me. Thank you for your help. – chaparral Jun 25 '23 at 13:26