How to reproduce Stata competing-risks survival analysis results in R?

Question

I am attempting to reproduce some survival analysis results published in a journal. The original results were produced in Stata. Here is the code:

stset time, id(leadid) failure(c_coup)

streg  legislature  leg_growth_2 gdp_1k chgdpen_fearonlaitin Oil_fearonlaitin postcoldwarlag civiliandictatorshiplag militarydictatorshiplag communist lpopl1_fearonlaitin ethfrac_fearonlaitin relfrac_fearonlaitin  age, distribution(weibull) time

Here is my attempt to translate this to R code:

# load packages
library(survival)
library(survminer)

# load data
data <- read.csv("leader_tvc_2.csv", encoding = "latin1")

# create survival object
surv_object <- Surv(time, c_coup) ~ legislature + leg_growth_2 + gdp_1k + chgdpen_fearonlaitin + Oil_fearonlaitin + postcoldwarlag + civiliandictatorshiplag + militarydictatorshiplag + communist + lpopl1_fearonlaitin + ethfrac_fearonlaitin + relfrac_fearonlaitin + age

# run survreg
fit <- survreg(surv_object, data = data, dist = "weibull")

# examine output 
summary(fit)

I am unable to obtain the same results. For example, the estimate for legislature in the Stata results (the ones I want to produce) is 2.057. In R, however, it is 2.48223. Any advice would be appreciated. Here is a link to the full data.

Here is the R output:

Call:
survreg(formula = surv_object, data = data, dist = "weibull")
                           Value Std. Error     z       p
(Intercept)              3.69422    0.62813  5.88 4.1e-09
legislature              2.48223    0.21580 11.50 < 2e-16
leg_growth_2             4.06103    1.81609  2.24  0.0253
gdp_1k                  -0.02217    0.03678 -0.60  0.5467
chgdpen_fearonlaitin     0.49579    1.11792  0.44  0.6574
Oil_fearonlaitin         0.24194    0.23799  1.02  0.3094
postcoldwarlag           0.55797    0.31626  1.76  0.0777
civiliandictatorshiplag -1.87331    0.32142 -5.83 5.6e-09
militarydictatorshiplag -1.34963    0.29635 -4.55 5.3e-06
communist                0.88426    0.28211  3.13  0.0017
lpopl1_fearonlaitin     -0.00252    0.06636 -0.04  0.9697
ethfrac_fearonlaitin     0.47922    0.28625  1.67  0.0941
relfrac_fearonlaitin     0.67571    0.37836  1.79  0.0741
age                      0.01261    0.00704  1.79  0.0733
Log(scale)              -0.12430    0.06339 -1.96  0.0499

Scale= 0.883 

Weibull distribution
Loglik(model)= -829   Loglik(intercept only)= -978
    Chisq= 297.98 on 13 degrees of freedom, p= 6.4e-56 
Number of Newton-Raphson Iterations: 18 
n=3774 (5911 observations deleted due to missingness)

Here are the original results (column 3)

I was unable to read the .dta file into R using `haven::read_dta()`. The parameterisation used by R is different. A related post is . — stephematician, May 19 '23 at 02:09
Thanks for trying. Would you mind tryingf with this link? Its the csv version.https://drive.google.com/file/d/1JtyIszUHDZP3Z8O3BFNtBzREFaK53KJh/view?usp=sharing — w5698, May 19 '23 at 03:37
Thanks, the .csv works. I recommend you check the stats stackexchange post, and check both the R survreg and weibull documentation, and the stata streg docs. The differences in parameterisations are tricky but you can convert between the two. Also, if you could post the output that you get from both commands, i.e. the coefficients and the values of the scale - that would help. — stephematician, May 19 '23 at 07:58
Thanks -- I have updated the post to provide that information! — w5698, May 19 '23 at 12:33
I never do this kind of thing in either program. While I hope the thread is inching towards a good answer, I have to wonder whether trying a much simpler model with a simpler dataset in both programs would be a better way to establish the correspondence of syntax. — Nick Cox, May 19 '23 at 12:37
I agree with @NickCox. Also, the results suggest something other than parameterisation differences. Differences in parameterisation usually lead to differences in the reported estimate for the intercept (effectively the scale) and the shape parameter, but the effects would typically be the same. In the tables presented, some of the effects are even in the opposite direction (e.g. GDP, population). Only way to solve this would be to re-run the analysis in STATA. — stephematician, May 20 '23 at 01:27
I did run the analysis in Stata, and I managed to produce the exact same results as in the manuscript above. — w5698, May 20 '23 at 02:44
I would try with a simpler dataset then e.g. take the `rats` data from the survival package in R, and see if you can reproduce the results in STATA. — stephematician, May 21 '23 at 05:29

Zhiqiang Wang · Accepted Answer · 2023-05-21T14:05:25.493

It is all about data setup for survival analysis. There are two issues:

The R survreg treats each observation as one independent sampling unit while your Stata stset time, id(leadid) failure(c_coup) specifies ID (leadid) with multiple observations. You need to set up a start-stop type Surv object.
survreg would throw a error message: "start-stop type Surv objects are not supported". Use flexsurv::flexsurvreg instead:

To get the same results as Stata in R, try this:

library(survival)
library(survminer)
library(tidyverse)
library(flexsurv)
data <- read.csv("leader_tvc_2.csv", encoding = "latin1")
df <- data %>% 
  mutate(time0 = lag(time, default=0), .by = leadid)
surv_object <- Surv(time0, time, c_coup) ~ legislature + leg_growth_2 + gdp_1k + chgdpen_fearonlaitin + Oil_fearonlaitin + postcoldwarlag + civiliandictatorshiplag + militarydictatorshiplag + communist + lpopl1_fearonlaitin + ethfrac_fearonlaitin + relfrac_fearonlaitin + age
fit <- flexsurvreg(surv_object, data=df, dist = "weibull") 

> broom::tidy(fit)
# A tibble: 15 × 5
   term                    estimate std.error statistic   p.value
   <chr>                      <dbl>     <dbl>     <dbl>     <dbl>
 1 shape                     1.09     0.0697     NA     NA       
 2 scale                    13.1      8.61       NA     NA       
 3 legislature               2.06     0.197      10.4    1.01e-25
 4 leg_growth_2              3.04     1.81        1.68   4.68e- 2
 5 gdp_1k                    0.0145   0.0382      0.380  3.52e- 1
 6 chgdpen_fearonlaitin      0.804    1.12        0.721  2.35e- 1
 7 Oil_fearonlaitin          0.126    0.250       0.505  3.07e- 1
 8 postcoldwarlag            0.517    0.329       1.57   5.84e- 2
 9 civiliandictatorshiplag  -1.28     0.325      -3.95   3.91e- 5
10 militarydictatorshiplag  -0.882    0.306      -2.88   1.96e- 3
11 communist                 0.480    0.282       1.71   4.40e- 2
12 lpopl1_fearonlaitin       0.0857   0.0664      1.29   9.87e- 2
13 ethfrac_fearonlaitin      0.419    0.298       1.41   7.97e- 2
14 relfrac_fearonlaitin      0.491    0.387       1.27   1.02e- 1
15 age                      -0.0156   0.00780    -2.00   2.27e- 2
>

How to reproduce Stata competing-risks survival analysis results in R?

1 Answers1