9

I have 4 populations with known means and standard deviations. I would like to know the grand mean and grand sd. The grand mean is obviously simple to calculate, but R has a handy utility function, weighted.mean(). Does a similar function exist for combining standard deviations?

The calculation is not complicated, but an existing function would make my code cleaner and easier to understand.

Bonus question, what tools do you use to search for functions like this? I know it must be out there, but I've done a lot of searching and can't find it. Thanks!

Sam Swift
  • 913
  • 1
  • 11
  • 17

3 Answers3

7

I don't know of a specific package or function name but it seems easy to roll your own function from Wikipedia's page. Assuming no overlap in the populations:

## N: vector of sizes
## M: vector of means
## S: vector of standard deviations

grand.mean <- function(M, N) {weighted.mean(M, N)}
grand.sd   <- function(S, M, N) {sqrt(weighted.mean(S^2 + M^2, N) -
                                      weighted.mean(M, N)^2)}
flodel
  • 87,577
  • 21
  • 185
  • 223
  • Thanks very much for this answer flodel. When looking at the formula on wikipedia, I didn't think I could make the calculation look as simple as you did. I may in fact just use this, but AndresT's response is a bit more comprehensive for others finding this question. Thanks! – Sam Swift Feb 10 '12 at 15:30
6

Are the populations non overlapping?

library(fishmethods)
combinevar

For instance the example in wikipedia would work like this:

xbar <- c(70,65)
s<-c(3,2)
n <- c(1,1)
combinevar(xbar,s,n)

and standard deviation would be sqrt(combinevar(xbar,s,n)[2])

if you don't want to download the library the function goes like this:

combinevar <- 
function (xbar = NULL, s_squared = NULL, n = NULL) 
{
    if (length(xbar) != length(s_squared) | length(xbar) != length(n) | 
        length(s_squared) != length(n)) 
        stop("Vector lengths are different.")
    sum_of_squares <- sum((n - 1) * s_squared + n * xbar^2)
    grand_mean <- sum(n * xbar)/sum(n)
    combined_var <- (sum_of_squares - sum(n) * grand_mean^2)/(sum(n) - 
        1)
    return(c(grand_mean, combined_var))
}
aatrujillob
  • 4,738
  • 3
  • 19
  • 32
  • I'll be glad if someone refutes this, but fishmethods::combinevar (version 1.11-3) doesn't seem to be reliable. For example, when there is only 1 SD value, its result can be NaN; when there are two identical ones, its result can be 0.0 - not what I'd expect. The simple solution from flodel seems to work. – jciloa Oct 02 '22 at 21:29
2

Use the sample.decomp function in the utilities package

Statistical problems of this kind have now been automated in the sample.decomp function in the utilities package. This function can compute pooled sample moments from subgroup moments, or compute missing subgroup moments from the other subgroup moments and pooled moments. It works for decompositions up to fourth order ---i.e., decompositions of sample size, sample mean, sample variance/standard deviation, sample skewness, and sample kurtosis.


How to use the function: Here we give an example where we use the function to compute the sample moments of a pooled sample composed of four subgroups. To do this, we first generate a mock dataset DATA containing four subgroups with unequal sizes, and we pool these as the single dataset POOL. The moments of the subgroups and the pooled sample can be obtained using the moments function in the same package.

#Create some subgroups of mock data and a pooled dataset
set.seed(1)
N    <- c(28, 44, 51, 102)
SUB1 <- rnorm(N[1])
SUB2 <- rnorm(N[2])
SUB3 <- rnorm(N[3])
SUB4 <- rnorm(N[4])
DATA <- list(SUB1 = SUB1, SUB2 = SUB2, SUB3 = SUB3, SUB4 = SUB4)
POOL <- c(SUB1, SUB2, SUB3, SUB4)

#Show sample statistics for the subgroups
library(utilities)
moments(DATA)

       n sample.mean sample.var sample.skew sample.kurt NAs
SUB1  28  0.09049834  0.9013829  -0.7648008    3.174128   0
SUB2  44  0.18637936  0.8246700   0.3653918    3.112901   0
SUB3  51  0.05986594  0.6856030   0.3076281    2.306243   0
SUB4 102 -0.05135660  1.0526184   0.3348429    2.741974   0

#Show sample statistics for the pooled sample
moments(POOL)

       n sample.mean sample.var sample.skew sample.kurt NAs
POOL 225  0.03799749  0.9030244   0.1705622    2.828833   0

Now that we have set of moments for subgroups, we can use the sample.decomp function to obtain the pooled sample moments from the subgroup sample moments. As an input to this function you can either use the moments output for the subgroups or you can input the sample sizes and sample moments separately as vectors (here we will do the latter). As you can see, this gives the same sample moments for the pooled sample as direct computation from the underlying data.

#Compute sample statistics for subgroups
library(utilities)
MEAN   <- c(mean(SUB1), mean(SUB2), mean(SUB3), mean(SUB4))
VAR    <- c( var(SUB1),  var(SUB2),  var(SUB3),  var(SUB4))

#Compute sample decomposition
sample.decomp(n = N, sample.mean = MEAN, sample.var  = VAR, names = names(DATA))

             n sample.mean sample.var
SUB1        28  0.09049834  0.9013829
SUB2        44  0.18637936  0.8246700
SUB3        51  0.05986594  0.6856030
SUB4       102 -0.05135660  1.0526184
--pooled-- 225  0.03799749  0.9030244

As you can see, the sample.decomp function allows computation of the pooled sample variance. You can read about this function in the package documentation.

Ben
  • 1,051
  • 8
  • 26