for a project I'm conducting a survival analysis in R on two datasets analysing how much survival time depends on some blood biomarkers; one was used for development and one for validation.
The problem is when I try to launch the plots of the survival curves, ggsurvplot doesn't work, giving me the error "Error in data.frame(..., check.names = FALSE) : arguments imply differing number of rows: 174, 0, 348". Also if I launch the object survfit with a normal plot, it gives me a cumulative hazard plot instead (??). This problem is present for both datasets equally, so in the following code I'll report the comands just for the first one.
THE DATA ARE AVAILABLE FOR DOWNLOAD AT THIS LINK: https://www.cancerdata.org/resource/doi:10.17195/candat.2016.04.1
WHAT I TRIED TO DO TO SOLVE THE PROBLEM AND DIDN'T WORK:
- Remove all NA values.
- Trying not to set chr's variables as factors (but even so, survfit requires the status variable as a logical or numeric. Also setting it as numeric makes all the observations of Status become NAs).
HERE IS THE CODE I WROTE SO FAR:
library(survival)
library(survminer)
library(corrplot)
library(readxl)
# development dataset (sheet 1 of 2 of the xlsx document)
dati_D = read_excel("carvalho-prognostic-biomarkers-NSCLC.xlsx", sheet=2)
# renaming some variables' names because of backspace in names
colnames(dati_D)[colnames(dati_D) == "Lymph nodes"] = "LymphNodes"
colnames(dati_D)[colnames(dati_D) == "RT Protocol"] = "RTProtocol"
colnames(dati_D)[colnames(dati_D) == "Total dose (1st)"] = "TotalDose1st"
colnames(dati_D)[colnames(dati_D) == "Total Dose (2nd)"] = "TotalDose2nd"
colnames(dati_D)[colnames(dati_D) == "IL 6"] = "IL6"
colnames(dati_D)[colnames(dati_D) == "IL 8"] = "IL8"
colnames(dati_D)[colnames(dati_D) == "Cyfra 21-1"] = "Cyfra211"
colnames(dati_D)[colnames(dati_D) == "WHO-PS"] = "WHOPS"
colnames(dati_D)[colnames(dati_D) == "CA-9"] = "CA9"
colnames(dati_D)[colnames(dati_D) == "FEV1s%"] = "FEV1s"
# setting chr's as factors
dati_D$Status = as.factor(dati_D$Status)
dati_D$histology = as.factor(dati_D$histology)
dati_D$stage = as.factor(dati_D$stage)
dati_D$Gender = as.factor(dati_D$Gender)
dati_D$RT_Protocol = as.factor(dati_D$RTProtocol)
str(dati_D)
# Survival curves
c1 = survfit(Surv(Survival, Status) ~ Gender, data = dati_D)
plot(c1) # it gives a cumulative hazard plot, why?
ggsurvplot(c1, data = dati_D) # it gives the error mentioned above, why?
Please help me identify and solve this issue, because of this I can't keep doing the analysis. Thank you in advance.
As I wrote above, I tried removing all NAs and not setting qualitative variables as factors, but that didn't work. The dataset is 182x21 so I really don't know what those numbers in the error refer to nor how to solve.