1

TASK: COMPUTE THE WEIGHTED SHARES OF MANIFESTATIONS OF A CATEGORICAL VARIABLE AND FIND CONFIDENCE INTERVALS AROUND THOSE WEIGHTED SHARES

library(dplyr)
set.seed(100)

Make up data set with a categorical variable and a weight variable:

df <- data.frame(
  Category = rep(c("A", "B", "C", "D"), times = seq(50, 200, length.out = 4)),
  Weight = sample(c(1, 1/2, 1/3, 1/4, 1/5), 500, prob = c(0.1, 0.2, 0.4, 0.2, 0.1), replace = TRUE)
)

Have a quick look at the data

head(df, 10)
tail(df, 10)

Now, I CAN complete the task without taking the weights into account:

Write function that returns the UNWEIGHTED share of one manifestation of the categorical variable together with its 95% confidence interval (for general information regarding the determination of confidence intervals for a population proportion, see e.g.: https://www.dummies.com/education/math/statistics/how-to-determine-the-confidence-interval-for-a-population-proportion/)

ci.share <- function(category, manifestation){
  n = length(category)
  share = length(which(category == manifestation)) / n
  se = sqrt (share*(1-share) / n) 

  if(n*share*(1-share) >= 9){
    U <- share - 1.96*se
    O <- share + 1.96*se
    KI <- c(U, share, O)
    names(KI) <- c("lower boundary", "share", "upper boundary")
    KI = KI * 100
    return(KI)
  } else {return("Error, because n*p*(1-p) < 9")}
}

Utilize function for all manifestations and store results in list:

cis <- list()
for(i in c("A", "B", "C", "D")){
  cis[[i]] <- ci.share(category = df$Category, manifestation = i)
}

Make results easily readable:

cis <- t(sapply(cis, "["))
cis <- round(cis, digits = 2)
cis  #TASK DONE

THE QUESTION IS: HOW TO GET THE EQUIVALENT FOR "cis" CONSIDERING THE WEIGHTS

What I can do, is finding the weighted shares:

ws <- summarise_at(group_by(df, Category), vars(Weight), funs(sum))
ws[,2] <- (ws[,2]/sum(df$Weight)) * 100
names(ws) <- c("Category", "Weighted_Share")
sum(ws$Weighted_Share) # Should be 100, is 100
ws

But how to get the Confidence Intervals now?

Would be very grateful for a solution. Thanks in advance! Andi

Andi Lope
  • 33
  • 1
  • 6
  • The formula for the CI of the population's proportion you are using comes from the Central Limit Theorem (CLT) applied to the sum of `n` _independent and identically distributed_ random variables as the sample size `n` tends to infinity. You could use the Lyapunov's version of CLT which applies to independent but _non identically distributed_ random variables (Ref: https://en.wikipedia.org/wiki/Central_limit_theorem#Lack_of_identical_distribution and https://stats.stackexchange.com/questions/89254/clt-can-be-used-for-weighted-sum-of-different-bernoulli-variables) – mastropi Mar 12 '20 at 15:04
  • Assuming you want the CI for each proportion, the questions I would like to have answered in order to provide a procedure are: what model is behind the use of the weighted average to estimate those proportions? Are you using weights because you believe they give you an unbiased estimator of the proportions? – mastropi Mar 12 '20 at 15:06

2 Answers2

5

For comparison purposes with the survey-based calculation of the proportion Confidence Intervals (CI), I here post the code to compute the CIs using the Central Limit Theorem (CLT) for non-identically distributed variables (as requested by the Original Poster (OP) in his comments):

1) Helper functions definition

#' Compute survey-based Confidence Intervals
#'
#' @param df data frame with at least one column: "Category".
#' @param design survey design, normally created with \code{svydesign()} of the survey package.
#' @details The confidence interval for the proportion of each "Category" present in \code{df}
#' is computed using the \code{svymean()} function of the survey package on the given \code{design}.
#' @return data frame containing estimated proportions for each "Category" and respective
#' 95% confidence intervals.
#'
#' ASSUMPTION: the data do not have any missing values
ci_survey <- function(df, design) {
  design.mean = svymean(~Category, design, deff="replace")
  CI = confint( design.mean )
  # Add the estimated proportions and Delta = Semi-size CI for comparison with the CLT-based results below
  CI = cbind( p=design.mean, Delta=( CI[,"97.5 %"] - CI[,"2.5 %"] ) / 2, CI )

  return(as.data.frame(CI))
}

#' Compute CLT-based Confidence Intervals
#'
#' @param df data frame with at least two columns: "Category", "Weight" (sampling weights)
#' @details The confidence interval for the proportion of each "Category" present in \code{df}
#' is computed using Lyapunov's or Lindenberg's version of the Central Limit Theorem (CLT) which
#' applies to independent but NOT identically distributed random variables
#' Ref: https://en.wikipedia.org/wiki/Central_limit_theorem#Lack_of_identical_distribution
#' @return data frame containing estimated proportions for each "Category" and respective
#' \code{ci_level} confidence intervals.
#'
#' ASSUMPTION: the data do not have any missing values
ci_clt <- function(df, ci_level) {
  ### The following calculations are based on the following setup for each category given in 'df':
  ### 1) Let X(i) be the measurement of variable X for sampled case i, i = 1 ... n (n=500 in this case)
  ### where X is a 0/1 variable indicating absence or presence of a selected category.
  ### From the X(i) samples we would like to estimate the
  ### true proportion p of the presence of the category in the population.
  ### Therefore X(i) are iid random variables with Binomial(1,p) distribution
  ###
  ### 2) Let Y(i) = w(i)*X(i)
  ### where w(i) is the sampling weight applied to variable X(i).
  ###
  ### We apply the CLT to the sum of the Y(i)'s, using:
  ### - E(Y(i)) = mu(i) = w(i) * E(X(i)) = w(i) * p (since w(i) is a constant and the X(i) are identically distributed)
  ### - Var(Y(i)) = sigma2(i) = w(i)^2 * Var(X(i)) = w(i)^2 * p*(1-p) (since the X(i) iid)
  ###
  ### Hence, by CLT:
  ###   Sum{Y(i) - mu(i)} / sigma -> N(0,1)
  ### where:
  ###   sigma = sqrt( Sum{ sigma2(i) } ) = sqrt( Sum{ w(i)^2 } ) * sqrt( p*(1-p) )
  ### and note that:
  ###   Sum{ mu(i) } = Sum{ w(i) } * p = n*p
  ### since the sampling weights are assumed to sum up to the sample size.
  ###
  ### Note: all the Sums are from i = 1, ..., n
  ###
  ### 3) Compute the approximate confidence interval for p based on the N(0,1) distribution
  ### in the usual way, by first estimating sigma replacing p for the estimated p.
  ###

  alpha = 1 - ci_level                                         # area outside the confidence band
  z = qnorm(1 - alpha/2)                                       # critical z-quantile from Normal(0,1)

  n = nrow(df)                                                 # Sample size (assuming no missing values)
  ws = df$Weight / sum(df$Weight) * n                          # Weights scaled to sum the sample size (assumed for sampling weights)
  S = aggregate(ws ~ Category, sum, data=df)                   # Weighted-base estimate of the total by category (Sum{ Y(i) })
  sigma2 = sum( ws^2 )                                         # Sum of squared weights (note that we must NOT sum by category)
  S[,"p"] = S[,"ws"] / n                                       # Estimated proportion by category
  S[,"Delta"] =  z * sqrt( sigma2 ) *
                     sqrt( S$p * (1 - S$p) ) / n               # Semi-size of the CI by category
  LB_name = paste(formatC(alpha/2*100, format="g"), "%")       # Name for the CI's Lower Bound column
  UB_name = paste(formatC((1 - alpha/2)*100, format="g"), "%") # Name for the CI's Upper Bound column
  S[,LB_name] = S[,"p"] - S[,"Delta"]                          # CI's Lower Bound
  S[,UB_name] = S[,"p"] + S[,"Delta"]                          # CI's Upper Bound

  return(S)
}

#' Show the CI with the specified number of significant digits
show_values <- function(values, digits=3) {
  op = options(digits=digits)
  print(values)
  options(op)
}

2) Simulated Data

### Simulated data
set.seed(100)
df <- data.frame(
  Category = rep(c("A", "B", "C", "D"), times = seq(50, 200, length.out = 4)),
  Weight = sample(c(1, 1/2, 1/3, 1/4, 1/5), 500, prob = c(0.1, 0.2, 0.4, 0.2, 0.1), replace = TRUE)
)

3) Survey-based CI (for reference)

# Computation of CI using survey sampling theory (implemented in the survey package)
library(survey)
design <- svydesign(ids = ~1, weights = ~Weight, data = df)
CI_survey = ci_survey(df, design)
show_values(CI_survey)

which gives:

               p  Delta  2.5 % 97.5 %
CategoryA 0.0896 0.0268 0.0628  0.116
CategoryB 0.2119 0.0417 0.1702  0.254
CategoryC 0.2824 0.0445 0.2379  0.327
CategoryD 0.4161 0.0497 0.3664  0.466

4) CLT-based CI

The description of the method used is included in the comments at the top of the ci_clt() function defined above.

# Computation of CI using the Central Limit Theorem for non-identically distributed variables
CI_clt = ci_clt(df, ci_level=0.95)
show_values(CI_clt)

which gives:

  Category    ws      p  Delta 2.5 % 97.5 %
1        A  44.8 0.0896 0.0286 0.061  0.118
2        B 106.0 0.2119 0.0410 0.171  0.253
3        C 141.2 0.2824 0.0451 0.237  0.328
4        D 208.0 0.4161 0.0494 0.367  0.465

5) Comparison of CI sizes

Here we compute the ratio between the CLT-based CIs and the survey-based CIs.

# Comparison of CI size
show_values(
  data.frame(Category=CI_clt[,"Category"],
             ratio_DeltaCI_clt2survey=CI_clt[,"Delta"] / CI_survey[,"Delta"])
)

which gives:

  Category ratio_DeltaCI_clt2survey
1        A                    1.067
2        B                    0.982
3        C                    1.013
4        D                    0.994

Since all the ratios are close to 1, we conclude that the CI sizes from the two methods are very similar!

6) Check that the CLT-based implementation for the CI seems to be correct

An convenient check to perform on the CLT-based calculation of the CIs is to run the calculation on the Simple Random Sample (SRS) case and verify that the results coincide with those given by the svymean() calculation under the SRS design.

# CLT-based calculation
df_noweights = df
df_noweights$Weight = 1                               # SRS: weights equal to 1
show_values( ci_clt(df_noweights, ci_level=0.95) )

which gives:

  Category   p  Delta  2.5 % 97.5 %
1        A 0.1 0.0263 0.0737  0.126
2        B 0.2 0.0351 0.1649  0.235
3        C 0.3 0.0402 0.2598  0.340
4        D 0.4 0.0429 0.3571  0.443

which we compare to the survey-based calculation:

# Survey-based calculation
design <- svydesign(ids=~1, probs=NULL, data=df)
show_values( ci_survey(df, design) )

that gives:

            p  Delta  2.5 % 97.5 %
CategoryA 0.1 0.0263 0.0737  0.126
CategoryB 0.2 0.0351 0.1649  0.235
CategoryC 0.3 0.0402 0.2598  0.340
CategoryD 0.4 0.0430 0.3570  0.443

We see that the results coincide, suggesting that the implementation of the CLT-based calculation of the CI in the general case with unequal weights is correct.

mastropi
  • 1,354
  • 1
  • 10
  • 14
  • Thank you very much for your comprehensive answer. I'm super busy at the time, but I'm going to study your contribution sooner or later. By looking through it, I wondered about the fact that the CI sizes from the two methods are very similar! Didn't you emphasize that the CIs you obtained using the CLT for non identically distributed variables are much smaller than those provided by svymean()? – Andi Lope Mar 24 '20 at 14:21
  • You are welcome. Please see my last comment to your original post where I write: "_I realized **I had made a mistake in the calculation of the CIs using CLT**... Now the CIs are very close to each other! (so my last comment is invalid)._" Looking forward to your feedback when you'll have more time. – mastropi Mar 24 '20 at 15:39
1

I finally found a very easy solution for my problem. The package "survey" does just what I want:

set.seed(100)
library(survey)

df <- data.frame(
  Category = rep(c("A", "B", "C", "D"), times = seq(50, 200, length.out = 4)),
  Weight = sample(c(1, 1/2, 1/3, 1/4, 1/5), 500, prob = c(0.1, 0.2, 0.4, 0.2, 0.1), replace = TRUE)
)

d <- svydesign(id = ~1, weights = ~Weight, data = df)
svymean(~Category, d)

The code results in the weighted proportions (same results as in "ws") and the corresponding standard errors, which makes it easy to calculate the confidence intervals.

Andi Lope
  • 33
  • 1
  • 6
  • I don't think your solution is the correct one for your problem. A couple of reasons: (i) from the documentation, the `weights` in the `svydesign()` function are supposed to be "Formula or vector specifying sampling weights as an alternative to prob". Conceptually, sampling weights are the inverse of the selection probabilities (of each unit in the population) (see e.g. https://unstats.un.org/unsd/demographic/meetings/egm/Sampling_1203/docs/no_5.pdf), this is why the documentation says "as an alternative to prob", where "prob" are the selection probabilities. – mastropi Mar 17 '20 at 23:54
  • (ii) when you ask for the design effect in the `svymean()` function by running `svymean(~Category, d, deff=TRUE)` you get the warning message "In svymean.survey.design2(~Category, d, deff = TRUE) : Sample size greater than population size: are weights correctly scaled?" This warning shows up because the sum of the inverse of the weights should be equal to the sample size, in your case 500, precisely because the sum of the selection probabilities should be equal to the sample size. – mastropi Mar 17 '20 at 23:55
  • 1
    (iii) when you ask for the confidence interval by running `confint( svymean(~Category, d) )` you get confidence intervals whose left bounds go way below 0 (e.g. -0.23) and one right bound is above 1. This does not make sense for the confidence interval of proportions. – mastropi Mar 17 '20 at 23:56
  • I think you should approach your problem via the use of a Central Limit Theorem as suggested by my first comment to your post, not by survey designs. And to that end, the answer to my questions in my second comment to your post would help. Essentially, what is the meaning of your weights? – mastropi Mar 17 '20 at 23:56
  • Hi @mastropi. Thanks alot for your detailed comments. I will try to break down the situation I am facing: In the first sampling step, people in a country get randomly chosen by lists of the registry office. In the next sampling step, everyone is additionally chosen who lives in the same household as a person who was chosen in the first step. That's how the sample comes into being. In this scenario, the selection probabilities can approximately be depicted by the househould size, so the sampling weights are the inverse of the household sizes. – Andi Lope Mar 19 '20 at 18:07
  • Of course, the real selection probabilities are a lot smaller than the inverse of the househould size, but the ratio between the inverse househould sizes of two observation units should be pretty much the same as the ratio of their actual selection probabilities. – Andi Lope Mar 19 '20 at 18:08
  • The data I got comprise the inverse household sizes as weights. And I am advised to take those weights into account when calculating population proportions and CIs. I do not know the actual selection probabilities. – Andi Lope Mar 19 '20 at 18:18
  • Ok, now I understand, thanks. Then your proposed solution should be fine. I also note that my comments above about the CIs of the proportion going out of range (e.g. negative lower bound) came by using your initial answer where the survey design was defined as `Category` as a cluster (i.e. using `id=~Category` which then you changed to the correct `id=~1`). – mastropi Mar 20 '20 at 11:16
  • Finally, I also computed the CIs using the CLT for non identically distributed variables, i.e. for `Y(i) = w(i)*X(i)`, where `w(i)` are the sampling weights and `X(i)` are the Binomially distributed variables for a given category, and I obtained confidence intervals that are **much smaller** than those provided by `svymean()` with the specified design. In fact, the CLT CIs are between 1.5 and 3.3 times smaller, and _they are closer to the survey-based CIs as the estimated proportion decreases_. This makes absolute sense, since the convergence of the CLT is faster as `p` is closer to 0.5. – mastropi Mar 20 '20 at 11:23
  • You are right, I initially mis-specified the survey design using `id=~Category`. You say, you also computed the CIs using CLT for non identically distributed variables. Would you mind posting the code? That would be very kind. Thanks in advance. – Andi Lope Mar 20 '20 at 17:44
  • Sure. And thanks to your request I realized I had made a mistake in the calculation of the CIs using CLT... Now the CIs are very close to each other! (so my last comment is invalid). Will post the full answer soon. – mastropi Mar 21 '20 at 14:55