0

If you run the following code, you will have a data frame real.dat which has 1063 samples for 20531 genes. There are 2 extra columns named time and event where time is the survival time and event is death in case of 1 and 0 in case of censored.

lung.dat <- read.table("genomicMatrix_lung")
lung.clin.dat <- read.delim("clinical_data_lung")

# For clinical data, get only rows which do not have NA in column "X_EVENT"
lung.no.na.dat <- lung.clin.dat[!is.na(lung.clin.dat$X_EVENT), ]

# Getting the transpose of main lung cancer data
ge <- t(lung.dat)
# Getting a vector of all the id's in the clinical data frame without any 'NA' values
keep <- lung.no.na.dat$sampleID

# getting only the samples(persons) for which we have a value rather than 'NA' values
real.dat <- ge[ge[, 1] %in% keep, ]

# adding the 2 columns from clinical data to gene expression data
keep_again <- real.dat[, 1]
temp_df <- lung.no.na.dat[lung.no.na.dat$sampleID %in% keep_again, ]

# naming the columns into our gene expression data
col_names <- ge[1, ]
colnames(real.dat) <- col_names

dd <- temp_df[, c('X_TIME_TO_EVENT', 'X_EVENT')]
real.dat <- cbind(real.dat, dd)

# renaming the 2 new added columns
colnames(real.dat)[colnames(real.dat) == 'X_TIME_TO_EVENT'] <- 'time'
colnames(real.dat)[colnames(real.dat) == 'X_EVENT'] <- 'event'

I want to get the univariate Cox regression p-value for each gene in the above data frame. How can I get this?

You can download the data from here.

Edit: Sorry for not clarifying enough. I have already tried to get it with the coxph function from the survival library. But even for one gene, it shows the following error -

> coxph(Surv(time, event) ~ HIF3A, real.dat) Error in fitter(X, Y, strats, offset, init, control, weights = weights, : NA/NaN/Inf in foreign function call (arg 6) In addition: Warning message: In fitter(X, Y, strats, offset, init, control, weights = weights, : Ran out of iterations and did not converge

That is why I did not provide a smaller reproducible example.

nafizh
  • 185
  • 3
  • 14
  • 1
    please make your code reproducible--there is even a built-in lung data set you can use. we shouldn't have to do extra work or download data on 20 thousand genes from third party websites to help you – rawr Apr 18 '16 at 18:41

1 Answers1

2

You really going to do univariate regression for each gene of 20531 genes??

Guessing wildly at the structure of your data (so creating a dummy set, based on the examples in help), and guessing what you're trying to do with the following toy example.....

library("survival")
?coxph ## to see the examples
## create dummy data
test <- list(time=c(4,3,1,1,2,2,3), 
          event=c(1,1,1,0,1,1,0), 
          gene1=c(0,2,1,1,1,0,0), 
          gene2=c(0,0,0,0,1,1,1))
## Cox PH regression
coxph(Surv(time, event) ~ gene1, test) 
coxph(Surv(time, event) ~ gene2, test)

You may wish to use the following to get CIs and more information.

summary(coxph(...)) 

Hopefully that code is reproducible enough to help you clarify the question

Big Old Dave
  • 343
  • 2
  • 11
  • Hi, thanks for your answer. But as I have updated my question, even for one gene, the `coxph` function shows error. – nafizh Apr 18 '16 at 22:26
  • So the question was really about interpreting the error message? – Big Old Dave Apr 19 '16 at 16:33
  • Hi, yeah, it was, sorry for my confusing question. But I found out the problem. As it turned out my columns were factor type, not numeric. That is why these errors were coming. But your example helped me to confirm that I was doing the right formula and the shape of my data frame was correct. In that spirit, I am accepting your answer. Thanks. – nafizh Apr 20 '16 at 00:11