0

I would like to perform an F-test on the equality of variances between a a sample conducted with simple random sampling, and one which incorporates weighting and stratification into the survey design. Several days ago, I asked a similar question about t-testing, which received an excellent answer, and can be referenced here:

R- How to conduct two-sample t-test with two different survey designs

For example, here, I have defined two survey designs: one is simple random sampling, and the other includes weighting and stratification. With the svyvar and degf functions, I can access the variance and degrees of freedom. However, I am not sure the resulting p-value is correct. Though the results of this sample code are somewhat difficult to judge given the microscopic sample sizes, when I use this code with my full dataset, the p-values are not what I would expect them to be.

I have a feeling my problem might have to do with the way the 'pf' function defines "degrees of freedom." However, I have so far found contradictory definitions of df1 and df2: some sources seem to be saying that these simply refer to the df of the two samples, while others mention the df "between groups" and "within groups," which I am not certain how to calculate given the variables which are retrievable from the survey package.

Am I totally off base with this formula/result?

library(survey)

wel <- c(68008.19, 128504.61,  21347.69,
         33272.95,  61828.96,  32764.44,
         92545.62,  58431.89,  95596.82,
         117734.27)
rmul <- c(16, 16, 16, 16, 16, 16, 16,
          20, 20, 20)
strat <- c(101, 101, 101, 101, 101, 102, 102, 102, 102, 102)

survey.data <- data.frame(wel, rmul, strat)

survey_unweighted <- svydesign(data = survey.data,
                               variables = ~wel,
                               ids = ~1)

survey_strat <- survey_strat <- svydesign(data = survey.data, 
                                          variables = ~wel,
                                          ids= ~1, 
                                          weights = ~rmul, 
                                          strata = ~strat, 
                                          nest = TRUE)

var1 <- coef(svyvar(~wel, survey_unweighted))
var2 <- coef(svyvar(~wel, survey_strat))
df_1 <- degf(survey_unweighted)
df_2 <- degf(survey_strat)

p_value <- pf((var1/var2), df_1, df_2, lower.tail = FALSE)
Nate
  • 61
  • 6
  • If you are not convinced about the p-values in your full dataset, perhaps you could use bootstrapping (https://en.wikipedia.org/wiki/Bootstrapping_%28statistics%29) on that full dataset to obtain the distribution of `(var1/df1) / (var2/df2)` and compute the p-value of the `(var1/df1) / (var2/df2)` ratio you get with your full dataset. – mastropi Jan 15 '22 at 20:17

1 Answers1

0

I don't think this is going to work. The design df you get from degf are intended to describe the uncertainty in the standard error of the mean, and related standard error estimates, not the uncertainty in the estimated population variance.

The standard error of the mean is genuinely a sum of squared things, one thing per PSU, and so there is a genuine sense in which its degrees of freedom can be described as the number of PSUs minus the number of strata. degf uses that formula, and generalises it to other settings. There's also a literature on better degree-of-freedom estimates in the sense that they give better-calibrated confidence intervals.

The estimated population variance isn't like that. It's more helpful to think of it as a mean, the mean of (X-E[X])^2. I don't know what is known about its distribution, but I wouldn't expect it to be close to a scaled chi^2 with degf(design) degrees of freedom.

Thomas Lumley
  • 1,893
  • 5
  • 8