2

When calculating a difference in two proportions I've seen that you can use the svyglm function in the survey package (gaussian family) to fit the model. Is there a way to obtain wilson 95% CIs for difference in proportions for survey weighted data?

I tried the following:

svy_design <- svydesign(ids = ~block, strata = ~strata_study, weights = ~wt, data = data, nest = TRUE, fpc = NULL)

model <- svyglm(as.numeric(binary_outcome)~binary_exposure, design=svy_design)

I got the following result:

Call:
svyglm(formula = as.numeric(binary_outcome) ~ binary_exposure, 
    design = svy_design, data = data)

Survey design:
svydesign(ids = ~block, strata = ~strata_study, weights = ~wt, 
    data = data, nest = TRUE, fpc = NULL)

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)      1.014101   0.003891 260.614   <2e-16 ***
binary_exposure  0.059430   0.039972   1.487    0.139    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for gaussian family taken to be 0.01723948)

Number of Fisher Scoring iterations: 2

confint(model)

                      2.5 %    97.5 %
(Intercept)       1.0064246 1.0217782
binary_exposure  -0.0194294 0.1382897

For the chi2 test:

svychisq(~binary_outcome+binary_exposure,design = svy_design, statistic="F") 

The result as follows:

Pearson's X^2: Rao & Scott adjustment

data:  svychisq(~binary_outcome + binary_exposure, design = svy_design,     statistic = "F")
F = 8.5087, ndf = 1, ddf = 200, p-value = 0.003938

The pvalues and 95% CI don't seem to match up.

How do you generate wilson 95% CI for a difference in proportions for survey weighted data?

Phil
  • 7,287
  • 3
  • 36
  • 66
s.g
  • 126
  • 7
  • What if you make your glm formula `as.numeric(binary_outcome) ~ 0 + binary_exposure`? – Phil Jul 31 '23 at 22:10

1 Answers1

2

This is a lot simpler with unweighted samples, since binary data must have exactly a Bernoulli distribution. That's what makes Wilson intervals possible. Once you add weights and clustering, there are no guarantees. Also, if you've told R your outcome variable isn't binary, it is going to believe you.

You can do better using the method="likelihood" option of confint.svyglm. I'm going to use one of the built-in data sets

First, the model: both as Gaussian and as binary with identity link

> summary(svyglm(I(ell==0)~comp.imp, design=dclus2))

Call:
svyglm(formula = I(ell == 0) ~ comp.imp, design = dclus2)

Survey design:
update(dclus2, noel = ell == 0)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)  0.01659    0.01758   0.943    0.351  
comp.impYes  0.05951    0.02768   2.150    0.038 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for gaussian family taken to be 0.05391936)

Number of Fisher Scoring iterations: 2
> summary(svyglm(I(ell==0)~comp.imp, design=dclus2,family=quasi(link="identity",variance="mu(1-mu)")))

Call:
svyglm(formula = I(ell == 0) ~ comp.imp, design = dclus2, family = quasi(link = "identity", 
    variance = "mu(1-mu)"))

Survey design:
dclus2<-svydesign(id=~dnum+snum, fpc=~fpc1+fpc2, data=apiclus2)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)  0.01659    0.01758   0.943    0.351  
comp.impYes  0.05951    0.02768   2.150    0.038 *

These are the same, because they are both saturated models and the fitted values are just the observed means.

Confidence intervals are the same with the (default) Wald method

> confint(svyglm(I(ell==0)~comp.imp, design=dclus2,family=quasi(link="identity",variance="mu(1-mu)")),method="Wald")
                   2.5 %     97.5 %
(Intercept) -0.019009223 0.05218458
comp.impYes  0.003483472 0.11553839
> confint(svyglm(I(ell==0)~comp.imp, design=dclus2),method="Wald")
                   2.5 %     97.5 %
(Intercept) -0.019009223 0.05218458
comp.impYes  0.003483472 0.11553839

but they are different with the likelihood method

> confint(svyglm(I(ell==0)~comp.imp, design=dclus2),method="likelihood")
                  2.5 %     97.5 %
(Intercept) 0.008321884 0.02485347
comp.impYes 0.046501049 0.07252081
attr(,"levels")
[1] 0.95
> confint(svyglm(I(ell==0)~comp.imp, design=dclus2,family=quasi(link="identity",variance="mu(1-mu)")),method="likelihood")
                   2.5 %     97.5 %
(Intercept)  0.002364424 0.07965595
comp.impYes -0.006835856 0.11875773

In particular, the likelihood intervals for the 'quasi' model are not symmetric.

None of these agrees exactly with svychisq, which is based on a loglinear model rather than linear regression or even logistic regression, and which uses different approximations. They all agree there is little evidence for a difference, though.

Thomas Lumley
  • 1,893
  • 5
  • 8
  • Thanks. I tried the likelihood method using binary with identity link and got the following error messgage: Error: no valid set of coefficients has been found: please supply starting values. I set starting values as follows: confint(svyglm(I(binary_outcome==1)~binary_exposure, design=svy_design,family=quasi(link="identity",variance="mu(1-mu)"), start=c(0.01,0.06)), method="likelihood"). I think the issue is with confint, as the svyglm model seems to run ok with supplied starting values. Why does this happen? Thanks – s.g Aug 04 '23 at 18:45
  • Also, would the asymmetric intervals be better in this case so as to not assume normality of the distribution of the parameter, i.e. risk difference? Thanks – s.g Aug 04 '23 at 19:16