0

I am trying to get the logrank p value from a survival Kaplan-Meier plot comparing cancer patients expressing low levels or the gene BRCA1 and patients expressing high levels of BRCA1 (BRCA1 -BReast CAncer gene 1)

So, I downloaded the cancer data -a DNA array- from GEO (GSE1456) Link to Cancer data from patients, and description

In this array BRCA1 quantification goes from 2 to 7 (arbritary units), so I want to split the data (low/high expression) at 5.74

split =5.74
probe = '204531_s_at' #code name for the probe that measures BRCA1

T1=df.loc[df[probe] < split ,'SURV_RELAPSE'] # Expression value (2-7)
T2=df.loc[df[probe] > split,'SURV_RELAPSE']

E1=df.loc[df[probe] < split ,'RELAPSE'] # 0- no cancer relapse, 1- cancer relapse
E2=df.loc[df[probe] > split ,'RELAPSE']

and I plot the Kaplan-Meier using lifelines

kmf = KaplanMeierFitter()
kmf.fit(T1, event_observed=E1, label='low') 
kmf.plot_survival_function(show_censors=True, ax=ax) 

kmf.fit(T2, event_observed=E2, label='high') 
kmf.plot_survival_function( show_censors=True, ax=ax)

enter image description here

and finally I got the logrank value as:

results = logrank_test(T1, T2, event_observed_A=E1, event_observed_B=E2)
results.print_summary()
print('p_value: ', results.p_value) 

with values p_value: 0.3932833438047245

To validate my analysis I compared to KM-plotter. The good thing is that it shows same image: enter image description here

Howerver, with a different logrank value, with a logrank p= 0.046

From my analysis the two survival plots would not be statistical significant, but by using KM-plotter they are.

I am suspecting that the censored samples might play a role in how the p is calculated but not sure how to fix it.

Question: Can anyone guide me on how to use lifelines to properly calculate the logrank?

thanks

EDITED to add raw data:

T1:
SURV_RELAPSE
100.32
98.76
97.44
96.84
76.56
97.56
48.96
97.32
66.36
16.68
75.84
91.32000000000001
96.35999999999999
88.56
96.35999999999999
79.80000000000001
92.76
93.72
95.4
95.52
82.32000000000001
84.36
101.52000000000001
73.56
86.52
97.56
84.84
47.88
90.6
93.0
96.96000000000001
18.6
100.80000000000001
90.48
75.72
99.60000000000001
84.84
97.80000000000001
98.88
72.24
70.19999999999999
69.72
70.32000000000001
68.52
67.56
95.88
19.32
99.96000000000001
69.84
70.32000000000001
51.599999999999994
95.03999999999999
78.24
77.52
101.16
74.88
46.68
91.08
2.7600000000000002
92.03999999999999
94.67999999999999
89.28
33.36
75.12
71.4
71.76
72.6
81.12
95.28
76.80000000000001
88.80000000000001
77.28
15.600000000000001
95.88
94.08
92.03999999999999
77.28
100.32
99.84
T2:
SURV_RELAPSE
45.839999999999996
97.80000000000001
96.84
99.60000000000001
41.64
6.720000000000001
92.52
53.28
94.32000000000001
35.04
13.080000000000002
52.32000000000001
90.6
14.28
61.92
91.80000000000001
87.48
96.35999999999999
98.39999999999999
91.08
101.76
100.08
87.36
94.08
91.80000000000001
87.36
92.76
22.200000000000003
37.8
60.0
69.0
68.16
66.72
68.28
99.24
12.48
72.36
89.03999999999999
51.239999999999995
90.12
99.60000000000001
93.36
71.4
99.84
46.32
94.67999999999999
97.56
99.84
78.6
78.6
84.6
13.440000000000001
68.28
92.03999999999999
47.04
93.36
90.0
88.32000000000001
92.76
13.440000000000001
17.52
101.64000000000001
96.35999999999999
9.120000000000001
91.56
93.0
15.120000000000001
79.32000000000001
86.28
10.8
75.84
101.88
101.88
71.28
16.080000000000002
8.040000000000001
33.480000000000004
16.56
67.44
66.96000000000001
E1:
RELAPSE
 0
 0
 0
 0
 0
 0
 1
 0
 1
 1
 0
 0
 0
 0
 0
 1
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 1
 0
 0
 0
 1
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 1
 0
 0
 0
 1
 0
 0
 0
 0
 1
 1
 0
 1
 0
 0
 0
 1
 0
 0
 0
 0
 0
 0
 0
 0
 1
 1
 0
 0
 0
 0
 0
 0
E2:
RELAPSE
 1
 0
 0
 0
 1
 1
 0
 1
 0
 1
 1
 1
 0
 1
 1
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 1
 1
 1
 0
 0
 0
 0
 0
 1
 0
 0
 1
 0
 0
 0
 0
 0
 1
 0
 0
 0
 0
 0
 0
 1
 0
 0
 1
 0
 0
 0
 0
 1
 1
 0
 0
 1
 0
 0
 1
 0
 0
 1
 0
 0
 0
 0
 1
 1
 1
 1
 0
 0
daniel_hck
  • 1,100
  • 3
  • 19
  • 38
  • 1
    what column is `204531_s_at` - that column name is not present in the table at https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?view=data&acc=GSE1456&id=8239&db=GeoDb_blob35 – Cam.Davidson.Pilon Jun 08 '20 at 20:28
  • 1
    Anyways, your lifelines code look right. Log-rank statistic can have different weightings, I wonder what KM-plotter is using. Anyways, if I can reproduce the dataset, I think I can help – Cam.Davidson.Pilon Jun 08 '20 at 20:34
  • Thank you for your answer. No, the link you added it is a summary table where each row is a patient. For each row (patients) there is another table of 22K (or so) genes. 204531_s_at is just one of them. Because this is complicated, I will try to copy-paste the filtered data and show only T1, T2, E1 and E2. – daniel_hck Jun 08 '20 at 21:12

0 Answers0