0

I've been working on gene mutation survival analysis, the data downloaded&merged from TCGA somatic mutation file (MAF) is:

         barcode stage_group gender fustat futime  SRCAP  ZFHX4  AMER1 PCDHB8 AHNAK2
1   TCGA-CA-6719   StageI-II   MALE      0     41     WT     WT     WT     WT     WT
2   TCGA-A6-2685   StageI-II FEMALE      0    464     WT     WT     WT     WT     WT
3   TCGA-CK-6751   StageI-II FEMALE      0    518     WT     WT     WT Mutate     WT
4   TCGA-DY-A1H8 StageIII-IV FEMALE      1    992     WT     WT     WT     WT     WT
5   TCGA-AG-3887   StageI-II   MALE      0     28     WT     WT     WT     WT     WT
6   TCGA-DM-A28M   StageI-II   MALE      0   2775     WT     WT     WT     WT Mutate
7   TCGA-CM-6675 StageIII-IV   MALE      0    153     WT     WT     WT     WT     WT
8   TCGA-D5-6533     Missing FEMALE      0     40     WT     WT     WT Mutate     WT
9   TCGA-SS-A7HO   StageI-II FEMALE      0   1829     WT     WT     WT     WT     WT
10  TCGA-AY-A8YK StageIII-IV   MALE      0    209     WT     WT     WT     WT     WT
11  TCGA-AA-A02Y   StageI-II   MALE      0     31     WT     WT     WT     WT     WT
12  TCGA-AD-5900   StageI-II   MALE      0      2     WT     WT     WT     WT Mutate

SRCAP ZFHX4 AMER1 PCDHB8 AHNAK2 ... are genes selected by the univariate KM survival& log-rank test, by dividing patient to Wt and Mutate group based on gene mutate status and then order the p-values, choose p=0.05 as the threshold. Now I need to take account of all clinical features into the analysis along with these genes:

Surv(futime, fustat)~ gender+age+project+subtype+race_group+stage_group+SRCAP+ZFHX4+AMER1+PCDHB8+AHNAK2+DNAH5+NALCN+PAPPA+PCDH17+RELN+UGGT2+HYDIN

and the result:

                      coef  exp(coef)   se(coef)  robust se      z Pr(>|z|)    
genderMALE       9.020e-01  2.465e+00  3.819e-01  3.696e-01  2.441 0.014659 *  
subtypeMissing   4.793e-01  1.615e+00  8.825e-01  1.045e+00  0.459 0.646364    
subtypeMucinous  1.354e+00  3.874e+00  5.972e-01  6.053e-01  2.238 0.025250 *  
race_groupWhite -6.223e-01  5.367e-01  3.921e-01  3.903e-01 -1.594 0.110878    
SRCAPWT         -1.233e+00  2.914e-01  5.177e-01  6.516e-01 -1.892 0.058474 .  
ZFHX4WT         -1.577e+00  2.065e-01  4.996e-01  5.621e-01 -2.806 0.005014 ** 
AMER1WT         -2.932e+00  5.332e-02  6.121e-01  5.547e-01 -5.285 1.26e-07 ***
AHNAK2WT         2.190e+00  8.932e+00  1.063e+00  9.183e-01  2.385 0.017097 *  
DNAH5WT          2.011e+00  7.474e+00  7.732e-01  6.077e-01  3.310 0.000932 ***
NALCNWT         -8.528e-01  4.262e-01  4.790e-01  4.151e-01 -2.055 0.039905 *  
RELNWT           2.063e+01  9.155e+08  5.425e+03  1.659e+00 12.435  < 2e-16 ***
UGGT2WT         -2.783e+00  6.185e-02  7.052e-01  5.688e-01 -4.893 9.95e-07 ***
HYDINWT          1.864e+00  6.450e+00  7.435e-01  7.284e-01  2.559 0.010499 *

I'm not convinced about the whole procedure and the result, how the "Stage" factor is not important to survival chance? besides, some gene's hazard ratio is incredible high(RELNWT :9.155e+08 ) . not sure if the reason is the sparse & binary feature of mutation data.

what's is the proper way to preform survival analysis based on mutation data? really need an explanation....thanks.

Roy
  • 11
  • 2
  • This seems like more of an analysis or data interpretation question than a coding question. You may have better luck getting an answer on a different stack exchange site (biology maybe?) – Jan Boyer Dec 12 '19 at 15:36
  • thanks, I'll check the exchange sites. actually I've asked this question on biostars, but no one replies. so I post it here – Roy Dec 13 '19 at 03:59

0 Answers0