Stratified cluster sampling estimates from survey package

Question

I want to estimate means and totals from a stratified sampling design in which single stage cluster sampling was used in each stratum. I believe I have the design properly specified using the svydesign() function of the survey package. But I'm not sure how to correctly specify the stratum weights.

Example code is shown below. I provide unadjusted stratum weights using the weights= argument. I expected that the estimate and the SE from svytotal() would be equal to the sum of the stratum weights (70, in the example) times the estimate and SE from svymean(). Instead the estimates differ by a factor of 530 (which is the sum of the stratum weights over all of the elements in the counts data) and the SEs differ by a factor of 898 (???). My questions are (1) how can I provide my 3 stratum weights to svydesign() in a way that it understands, and (2) why aren't the estimates and SEs from svytotal() and svymean() differing by the same factor?

library(survey)

# example data from a stratified sampling design in which
# single stage cluster sampling is used in each stratum
counts <- data.frame(
  Stratum=rep(c("A", "B", "C"), c(5, 8, 8)), 
  Cluster=rep(1:8, c(3, 2, 3, 2, 3, 2, 3, 3)),
  Element=c(1, 2, 3, 1, 2, 1, 2, 3, 1, 2, 1, 2, 3, 1, 2, 1, 2, 3, 1, 2, 3),
  Count = 1:21
)
# stratum weights
weights <- data.frame(
  Stratum=c("A", "B", "C"),
  W=c(10, 20, 40)
)

# combine counts and weights
both <- merge(counts, weights)

# estimate mean and total count
D <- svydesign(id=~Cluster, strata=~Stratum, weights=~W, data=both)
a <- svymean(~Count, D)
b <- svytotal(~Count, D)

sum(weights$W)  #  70
sum(both$W)     # 530
coef(b)/coef(a) # 530 
SE(b)/SE(a)     # 898.4308

First update

I'm adding a diagram to help explain my design. The entire population is a lake with known area (70 ha in this example). The strata have known areas, too (10, 20, and 40 ha). The number of clusters allocated to each stratum was not proportional. Also, the clusters are tiny relative to the number that could possibly be sampled, so the finite population correction is FPC = 1.

I want to calculate an overall mean and SE on a per unit area basis and a total that is equal to 70 times this mean and SE.

Second update

I wrote the code to do the calculations from scratch. I get a total estimate of 920 with se 61.6.

library(survey)
library(tidyverse)

# example data from a stratified sampling design in which
# single stage cluster sampling is used in each stratum
counts <- data.frame(
  Stratum=rep(c("A", "B", "C"), c(5, 8, 8)),
  Cluster=rep(1:8, c(3, 2, 3, 2, 3, 2, 3, 3)),
  Element=c(1, 2, 3, 1, 2, 1, 2, 3, 1, 2, 1, 2, 3, 1, 2, 1, 2, 3, 1, 2, 3),
  Count = c(5:1, 6:21)
)
# stratum weights
areas <- data.frame(
  Stratum=c("A", "B", "C"),
  A_h=c(10, 20, 40)
)

# calculate cluster means
step1 <- counts %>%
  group_by(Stratum, Cluster) %>%
  summarise(P_hi = sum(Count), m_hi=n())
step2 <- step1 %>%
  group_by(Stratum) %>%
  summarise(
    ybar_h = sum(P_hi) / sum(m_hi),
    n_h = n(),
    sh.numerator = sum((P_hi - ybar_h*m_hi)^2),
    mbar_h = mean(m_hi)
  ) %>%
  mutate(
    S_ybar_h = 1 / mbar_h * sqrt( sh.numerator / (n_h * (n_h-1)) )
  )

# now expand up to strata
step3 <- step2 %>%
  left_join(areas) %>%
  mutate(
    W_h = A_h / sum(A_h)
  ) %>%
  summarise(
    A = sum(A_h),
    ybar_strat = sum(W_h * ybar_h),
    S_ybar_strat = sum(W_h * S_ybar_h / sqrt(n_h))
  ) %>%
  mutate(
    tot = A * ybar_strat,
    S_tot = A * S_ybar_strat
  )

step2
step3

This gives the following output:

> step2
# A tibble: 3 x 6
  Stratum ybar_h   n_h sh.numerator   mbar_h S_ybar_h
   <fctr>  <dbl> <int>        <dbl>    <dbl>    <dbl>
1       A    3.0     2         18.0 2.500000 1.200000
2       B    9.5     3        112.5 2.666667 1.623798
3       C   17.5     3         94.5 2.666667 1.488235
> step3
# A tibble: 1 x 5
      A ybar_strat S_ybar_strat   tot   S_tot
  <dbl>      <dbl>        <dbl> <dbl>   <dbl>
1    70   13.14286    0.8800657   920 61.6046

score 2 · Answer 1 · answered Mar 25 '18 at 03:45

(Revised answer to revised question)

In this case svytotal isn't what you want -- it's for the actual population total of the elements being sampled, and so doesn't make sense when the population is thought of as infinitely bigger than the sample. The whole survey package is really designed for discrete, finite populations, but we can work around it.

I think you want to get a mean for each stratum and then multiply it by the stratum weights. To do that,

D <- svydesign(id=~Cluster, strata=~Stratum, data=both)
means<- svyby(~Count, ~Stratum, svymean, design=D)
svycontrast(means, quote(10*A+20*B+40*C))

You'll get a warning

Warning message:
In vcov.svyby(stat) : Only diagonal elements of vcov() available

That's because svyby doesn't return covariances between the stratum means. It's harmless, because the strata really are independent samples (that's what stratification means) so the covariances are zero.

This is helpful. I decided to walk through the calculations myself (see second update to my post). I get the same stratum means and the same total estimate, but a slightly different standard error (you 68.9, me 61.6). I'm guessing this has to do with some finite population correction that you employed and I did not. — Jean V. Adams, Mar 25 '18 at 13:45

score 1 · Answer 2 · answered Mar 25 '18 at 02:30

svytotal is doing what I think it should do here: weights are based on sampling probability, so they are only defined for sampling units. The svydesign call applied those weights to the clusters and (because cluster sampling) to the elements, giving the 530-fold higher total. You need to supply either observation weights or enough information for svydesign to calculate them itself. If this is cluster sampling with no subsampling, you can divide the stratum weight over the clusters to get the cluster weight and the divide this over elements within a cluster to get the observation weight. Or, if the stratum weight is the number of clusters in the population, you can use the fpc argument to svydesign

The fact that the SE doesn't scale the same way as the point estimate is because the population size is unknown and has to be estimated. The mean is the estimated total divided by the estimated population size, and the SE estimate takes account of the variance of the denominator and its covariance with the numerator.

I see what you're saying about the weights. I need to calculate the weight for each element as a proportion (I think that's what you mean by "observation weights"). I thought there might be a built-in way to do that in the **survey** package. Your explanation about the SEs is both revealing and puzzling. In this case, I know the up-scaling factor. It's the area of a lake. I've added some more description and a diagram to help explain the situation. — Jean V. Adams, Mar 25 '18 at 03:20

Stratified cluster sampling estimates from survey package

2 Answers2