0

for a project I'm conducting a survival analysis in R on two datasets analysing how much survival time depends on some blood biomarkers; one was used for development and one for validation.

The problem is when I try to launch the plots of the survival curves, ggsurvplot doesn't work, giving me the error "Error in data.frame(..., check.names = FALSE) : arguments imply differing number of rows: 174, 0, 348". Also if I launch the object survfit with a normal plot, it gives me a cumulative hazard plot instead (??). This problem is present for both datasets equally, so in the following code I'll report the comands just for the first one.

THE DATA ARE AVAILABLE FOR DOWNLOAD AT THIS LINK: https://www.cancerdata.org/resource/doi:10.17195/candat.2016.04.1

WHAT I TRIED TO DO TO SOLVE THE PROBLEM AND DIDN'T WORK:

  • Remove all NA values.
  • Trying not to set chr's variables as factors (but even so, survfit requires the status variable as a logical or numeric. Also setting it as numeric makes all the observations of Status become NAs).

HERE IS THE CODE I WROTE SO FAR:

library(survival)
library(survminer)
library(corrplot)
library(readxl)

# development dataset (sheet 1 of 2 of the xlsx document)
dati_D = read_excel("carvalho-prognostic-biomarkers-NSCLC.xlsx", sheet=2)

# renaming some variables' names because of backspace in names
colnames(dati_D)[colnames(dati_D) == "Lymph nodes"] = "LymphNodes"
colnames(dati_D)[colnames(dati_D) == "RT Protocol"] = "RTProtocol"
colnames(dati_D)[colnames(dati_D) == "Total dose (1st)"] = "TotalDose1st"
colnames(dati_D)[colnames(dati_D) == "Total Dose (2nd)"] = "TotalDose2nd"
colnames(dati_D)[colnames(dati_D) == "IL 6"] = "IL6"
colnames(dati_D)[colnames(dati_D) == "IL 8"] = "IL8"
colnames(dati_D)[colnames(dati_D) == "Cyfra 21-1"] = "Cyfra211"
colnames(dati_D)[colnames(dati_D) == "WHO-PS"] = "WHOPS"
colnames(dati_D)[colnames(dati_D) == "CA-9"] = "CA9"
colnames(dati_D)[colnames(dati_D) == "FEV1s%"] = "FEV1s"

# setting chr's as factors
dati_D$Status = as.factor(dati_D$Status)
dati_D$histology = as.factor(dati_D$histology)
dati_D$stage = as.factor(dati_D$stage)
dati_D$Gender = as.factor(dati_D$Gender)
dati_D$RT_Protocol = as.factor(dati_D$RTProtocol)
str(dati_D)

# Survival curves
c1 = survfit(Surv(Survival, Status) ~ Gender, data = dati_D)
plot(c1) # it gives a cumulative hazard plot, why?
ggsurvplot(c1, data = dati_D) # it gives the error mentioned above, why?

Please help me identify and solve this issue, because of this I can't keep doing the analysis. Thank you in advance.

As I wrote above, I tried removing all NAs and not setting qualitative variables as factors, but that didn't work. The dataset is 182x21 so I really don't know what those numbers in the error refer to nor how to solve.

IRTFM
  • 258,963
  • 21
  • 364
  • 487

2 Answers2

0

SOLVED!!

the problem was that the variable Status had the values “alive” and “dead” instead of 0 and 1. Change them to these solved the issue.

I leave this here in case anyone will run into a similar problem.

  • Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Feb 26 '23 at 19:52
  • @Community : see if my answer addresses your concerns. – IRTFM Feb 28 '23 at 00:07
0

One starts by downloading the file:

download.file("https://www.cancerdata.org/system/files/publications/carvalho-prognostic-biomarkers-NSCLC.xlsx?file=1&type=node&id=64&force=",destfile="dat1.xlsx")
# Then read from the working directory which happened to be "~/"
#Note the need to name the `sheet` argument and choose the second sheet
dati_D = read_excel("~/dat1.xlsx", sheet=2)
# Then examine the result
str(dati_D)
# Which shows that the Status variable is "alive" or "dead"
dati_D$logical_status <- dati_D$Status == "dead"
#The survival::Surv function assumes 1 (TRUE) is dead

c1 = survfit(Surv(Survival, logical_status) ~ Gender, data = dati_D)
plot(c1) # it gives a cumulative hazard plot, why?
ggsurvplot(c1, data = dati_D) 

enter image description here

IRTFM
  • 258,963
  • 21
  • 364
  • 487