For comparison purposes with the survey-based calculation of the proportion Confidence Intervals (CI), I here post the code to compute the CIs using the Central Limit Theorem (CLT) for non-identically distributed variables (as requested by the Original Poster (OP) in his comments):
1) Helper functions definition
#' Compute survey-based Confidence Intervals
#'
#' @param df data frame with at least one column: "Category".
#' @param design survey design, normally created with \code{svydesign()} of the survey package.
#' @details The confidence interval for the proportion of each "Category" present in \code{df}
#' is computed using the \code{svymean()} function of the survey package on the given \code{design}.
#' @return data frame containing estimated proportions for each "Category" and respective
#' 95% confidence intervals.
#'
#' ASSUMPTION: the data do not have any missing values
ci_survey <- function(df, design) {
design.mean = svymean(~Category, design, deff="replace")
CI = confint( design.mean )
# Add the estimated proportions and Delta = Semi-size CI for comparison with the CLT-based results below
CI = cbind( p=design.mean, Delta=( CI[,"97.5 %"] - CI[,"2.5 %"] ) / 2, CI )
return(as.data.frame(CI))
}
#' Compute CLT-based Confidence Intervals
#'
#' @param df data frame with at least two columns: "Category", "Weight" (sampling weights)
#' @details The confidence interval for the proportion of each "Category" present in \code{df}
#' is computed using Lyapunov's or Lindenberg's version of the Central Limit Theorem (CLT) which
#' applies to independent but NOT identically distributed random variables
#' Ref: https://en.wikipedia.org/wiki/Central_limit_theorem#Lack_of_identical_distribution
#' @return data frame containing estimated proportions for each "Category" and respective
#' \code{ci_level} confidence intervals.
#'
#' ASSUMPTION: the data do not have any missing values
ci_clt <- function(df, ci_level) {
### The following calculations are based on the following setup for each category given in 'df':
### 1) Let X(i) be the measurement of variable X for sampled case i, i = 1 ... n (n=500 in this case)
### where X is a 0/1 variable indicating absence or presence of a selected category.
### From the X(i) samples we would like to estimate the
### true proportion p of the presence of the category in the population.
### Therefore X(i) are iid random variables with Binomial(1,p) distribution
###
### 2) Let Y(i) = w(i)*X(i)
### where w(i) is the sampling weight applied to variable X(i).
###
### We apply the CLT to the sum of the Y(i)'s, using:
### - E(Y(i)) = mu(i) = w(i) * E(X(i)) = w(i) * p (since w(i) is a constant and the X(i) are identically distributed)
### - Var(Y(i)) = sigma2(i) = w(i)^2 * Var(X(i)) = w(i)^2 * p*(1-p) (since the X(i) iid)
###
### Hence, by CLT:
### Sum{Y(i) - mu(i)} / sigma -> N(0,1)
### where:
### sigma = sqrt( Sum{ sigma2(i) } ) = sqrt( Sum{ w(i)^2 } ) * sqrt( p*(1-p) )
### and note that:
### Sum{ mu(i) } = Sum{ w(i) } * p = n*p
### since the sampling weights are assumed to sum up to the sample size.
###
### Note: all the Sums are from i = 1, ..., n
###
### 3) Compute the approximate confidence interval for p based on the N(0,1) distribution
### in the usual way, by first estimating sigma replacing p for the estimated p.
###
alpha = 1 - ci_level # area outside the confidence band
z = qnorm(1 - alpha/2) # critical z-quantile from Normal(0,1)
n = nrow(df) # Sample size (assuming no missing values)
ws = df$Weight / sum(df$Weight) * n # Weights scaled to sum the sample size (assumed for sampling weights)
S = aggregate(ws ~ Category, sum, data=df) # Weighted-base estimate of the total by category (Sum{ Y(i) })
sigma2 = sum( ws^2 ) # Sum of squared weights (note that we must NOT sum by category)
S[,"p"] = S[,"ws"] / n # Estimated proportion by category
S[,"Delta"] = z * sqrt( sigma2 ) *
sqrt( S$p * (1 - S$p) ) / n # Semi-size of the CI by category
LB_name = paste(formatC(alpha/2*100, format="g"), "%") # Name for the CI's Lower Bound column
UB_name = paste(formatC((1 - alpha/2)*100, format="g"), "%") # Name for the CI's Upper Bound column
S[,LB_name] = S[,"p"] - S[,"Delta"] # CI's Lower Bound
S[,UB_name] = S[,"p"] + S[,"Delta"] # CI's Upper Bound
return(S)
}
#' Show the CI with the specified number of significant digits
show_values <- function(values, digits=3) {
op = options(digits=digits)
print(values)
options(op)
}
2) Simulated Data
### Simulated data
set.seed(100)
df <- data.frame(
Category = rep(c("A", "B", "C", "D"), times = seq(50, 200, length.out = 4)),
Weight = sample(c(1, 1/2, 1/3, 1/4, 1/5), 500, prob = c(0.1, 0.2, 0.4, 0.2, 0.1), replace = TRUE)
)
3) Survey-based CI (for reference)
# Computation of CI using survey sampling theory (implemented in the survey package)
library(survey)
design <- svydesign(ids = ~1, weights = ~Weight, data = df)
CI_survey = ci_survey(df, design)
show_values(CI_survey)
which gives:
p Delta 2.5 % 97.5 %
CategoryA 0.0896 0.0268 0.0628 0.116
CategoryB 0.2119 0.0417 0.1702 0.254
CategoryC 0.2824 0.0445 0.2379 0.327
CategoryD 0.4161 0.0497 0.3664 0.466
4) CLT-based CI
The description of the method used is included in the comments at the top of the ci_clt()
function defined above.
# Computation of CI using the Central Limit Theorem for non-identically distributed variables
CI_clt = ci_clt(df, ci_level=0.95)
show_values(CI_clt)
which gives:
Category ws p Delta 2.5 % 97.5 %
1 A 44.8 0.0896 0.0286 0.061 0.118
2 B 106.0 0.2119 0.0410 0.171 0.253
3 C 141.2 0.2824 0.0451 0.237 0.328
4 D 208.0 0.4161 0.0494 0.367 0.465
5) Comparison of CI sizes
Here we compute the ratio between the CLT-based CIs and the survey-based CIs.
# Comparison of CI size
show_values(
data.frame(Category=CI_clt[,"Category"],
ratio_DeltaCI_clt2survey=CI_clt[,"Delta"] / CI_survey[,"Delta"])
)
which gives:
Category ratio_DeltaCI_clt2survey
1 A 1.067
2 B 0.982
3 C 1.013
4 D 0.994
Since all the ratios are close to 1, we conclude that the CI sizes from the two methods are very similar!
6) Check that the CLT-based implementation for the CI seems to be correct
An convenient check to perform on the CLT-based calculation of the CIs is to run the calculation on the Simple Random Sample (SRS) case and verify that the results coincide with those given by the svymean()
calculation under the SRS design.
# CLT-based calculation
df_noweights = df
df_noweights$Weight = 1 # SRS: weights equal to 1
show_values( ci_clt(df_noweights, ci_level=0.95) )
which gives:
Category p Delta 2.5 % 97.5 %
1 A 0.1 0.0263 0.0737 0.126
2 B 0.2 0.0351 0.1649 0.235
3 C 0.3 0.0402 0.2598 0.340
4 D 0.4 0.0429 0.3571 0.443
which we compare to the survey-based calculation:
# Survey-based calculation
design <- svydesign(ids=~1, probs=NULL, data=df)
show_values( ci_survey(df, design) )
that gives:
p Delta 2.5 % 97.5 %
CategoryA 0.1 0.0263 0.0737 0.126
CategoryB 0.2 0.0351 0.1649 0.235
CategoryC 0.3 0.0402 0.2598 0.340
CategoryD 0.4 0.0430 0.3570 0.443
We see that the results coincide, suggesting that the implementation of the CLT-based calculation of the CI in the general case with unequal weights is correct.