2

I have been using the survival package in R to generate Kaplan-Meier plots using tables of mortality data. I am now trying to generate the same type of plots with new data that is in a different format. I would like to convert the data so it is in the same format as the input data I have been using to succesfully make Kaplan-Meier plots.

The data I typically has the following format:

ID Entry_type Departure_type Birth_date Death_date  Lifespan
234 B 1 2008-05-01 2020-01-20 10.986995
549 B 1 2014-05-25 2016-02-09 1.711157
235 B 1 2015-02-01 2017-08-01 2.496920ID

This data has one individual per row, specified by their ID. The final column is that individual's lifespan

I can generate a Kaplan-Meier plot from this table using the code:

survival_plot <- ggsurvplot(
    fit = survfit(Surv(Lifespan, Depart.Type) ~ 1, data = life_span_table), 
    xlab = "Years", 
    ylab = "Overall survival probability",
    surv.median.line = "hv",
    title = "Survival Curve",
    legend.title="")

However I now have census record data where I am tracking a cohort born in the year 1905. The data looks like this:

Year Age Total BirthYear CumDeath
1905  0 9262 1905 9262
1906 1 1335 1905 10597
1907 2 514 1905 11111

Explainer for the columns:

Year: The year being recorded.
Age: The age of the individual from the 1905 cohort in that year.
Total: The number of individuals born in 1905 who died in the recorded year.
BirthYear: Always 1905 for this cohort (redundant information)
CumDeath: The total number of people from the cohort who have died at the point of recording (cumulative death). 

So this data format does not have an individual per row as previously. I feel the only way to make identical Kaplan-Meier plots using this data is to reformat it so that there is one individual per row and their Lifespan data is added as a new column to fit the format I usually work with.

Lifespan can be calculated as the difference between the Year and the Birthyear, but I would need to make the number of rows for each lifespan equal to the Total entry for each year. Then I can add a Entry_Type and Departure_Type column for each row that would always be 'B' for Entry_type and '1' for Departure_type.

I usually use dplyr for reformatting tables but I am not sure of the best way to add new rows for each lifespan conditional on the Total entry. Can dplyr do this or am I better off using a loop or lapply function?

Any help is greatly appreciated.

user964689
  • 812
  • 7
  • 20
  • 40
  • You should read up on the construction of "life tables". There are several R packages that will do this for you. – IRTFM Sep 13 '21 at 22:24

1 Answers1

0

It turns out to be a rather easy exercise using R data.frame row indexing. Consider this toy example (which is pretty similar to yours):

tt = data.frame(name=c('a', 'b', 'c'), 
                count=c(1, 2, 3))
# extract counts specifying how many times to replicate each row
counts = tt$count
# construct row indexes
row_positions = seq_along(counts)
# or
row_positions = seq(1, nrow(tt))
# inflate vector of row indexes according to the counts
row_pos_replicated = rep(row_positions, counts)
# use inflated indexes to expand frame rows per `count` values
tt_replicated = tt[row_pos_replicated,]

We just expanded rows from tt into new frame tt_replicated according to the count column:

> tt_replicated
       name  count
1      a     1
2      b     2
2.1    b     2
3      c     3
3.1    c     3
3.2    c     3

Solution for your frame based on above and using more condensed syntax:

census_data = read.csv(text = "Year Age Total BirthYear CumDeath
1905 0 9262 1905 9262
1906 1 1335 1905 10597
1907 2 514 1905 11111", sep="")

census_data_for_KMplot = census_data[rep(seq(1,
                                             nrow(census_data)),
                                         census_data$CumDeath),]

Let's count number of rows for each Year (using data.table):

> data.table::setDT(census_data_for_KMplot)[, .N, by=Year]
   Year     N
1: 1905  9262
2: 1906 10597
3: 1907 11111

Using data.table for speed and somewhat simpler syntax (notice no sep= in fread and no , inside [] while row indexing):

library(data.table)

census_data = fread(text = "Year Age Total BirthYear CumDeath
1905 0 9262 1905 9262
1906 1 1335 1905 10597
1907 2 514 1905 11111")

census_data_for_KMplot = census_data[rep(seq_along(census_data$CumDeath), 
                                     census_data$CumDeath)]
census_data_for_KMplot[, .N, by=Year]

   Year     N
1: 1905  9262
2: 1906 10597
3: 1907 11111
topchef
  • 19,091
  • 9
  • 63
  • 102
  • Thanks I think this is very close but I miscommunicated, I should have said it is dependent on Total not CumDeath. For the year 1907 there should only be 514 rows/entries because 514 individuals born in 1905 died in that year. With your code the N increases every year but what I need is the number of rows for each year to correspond to the value in the 'Total' column. I will try to modify the code as it was my error. – user964689 Oct 01 '21 at 13:38