I have been using the survival package in R to generate Kaplan-Meier plots using tables of mortality data. I am now trying to generate the same type of plots with new data that is in a different format. I would like to convert the data so it is in the same format as the input data I have been using to succesfully make Kaplan-Meier plots.
The data I typically has the following format:
ID Entry_type Departure_type Birth_date Death_date Lifespan
234 B 1 2008-05-01 2020-01-20 10.986995
549 B 1 2014-05-25 2016-02-09 1.711157
235 B 1 2015-02-01 2017-08-01 2.496920ID
This data has one individual per row, specified by their ID. The final column is that individual's lifespan
I can generate a Kaplan-Meier plot from this table using the code:
survival_plot <- ggsurvplot(
fit = survfit(Surv(Lifespan, Depart.Type) ~ 1, data = life_span_table),
xlab = "Years",
ylab = "Overall survival probability",
surv.median.line = "hv",
title = "Survival Curve",
legend.title="")
However I now have census record data where I am tracking a cohort born in the year 1905. The data looks like this:
Year Age Total BirthYear CumDeath
1905 0 9262 1905 9262
1906 1 1335 1905 10597
1907 2 514 1905 11111
Explainer for the columns:
Year: The year being recorded.
Age: The age of the individual from the 1905 cohort in that year.
Total: The number of individuals born in 1905 who died in the recorded year.
BirthYear: Always 1905 for this cohort (redundant information)
CumDeath: The total number of people from the cohort who have died at the point of recording (cumulative death).
So this data format does not have an individual per row as previously. I feel the only way to make identical Kaplan-Meier plots using this data is to reformat it so that there is one individual per row and their Lifespan data is added as a new column to fit the format I usually work with.
Lifespan can be calculated as the difference between the Year and the Birthyear, but I would need to make the number of rows for each lifespan equal to the Total entry for each year. Then I can add a Entry_Type and Departure_Type column for each row that would always be 'B' for Entry_type and '1' for Departure_type.
I usually use dplyr for reformatting tables but I am not sure of the best way to add new rows for each lifespan conditional on the Total entry. Can dplyr do this or am I better off using a loop or lapply function?
Any help is greatly appreciated.