0

I have a data set with mean values, standard deviations and n. One of the variables has an equal sample size, while the sample size for the other one varies.

dat <- data.frame(variable = c(rep("x", 2), rep("y", 3)), replicate = c(1,2,1,2,3),
mean = c(3.4, 2.5, 6.5, 5.7, 5.1), sd = c(1.2, 0.7, 2.4, 4.0, 3.5),
n = c(3,3,5,4,6))

I need to combine x and y variables and am trying to find a code-sparing way to calculate combined standard deviation for instance using by aggregate function. The equation for combined standard deviation is following:

enter image description here

And for unequal sample sizes (same source):

enter image description here

My combined data frame should look like this:

variable    mean    sd
x           2.95    sd_x
y           5.76    sd_y

How to make a function in R that calculates the combined standard deviation? Or alternatively, if there is a package designed for this, it counts as an answer too =)

Mikko
  • 7,530
  • 8
  • 55
  • 92
  • 2
    It really sounds like you're asking people to write a function _for you_ that simply implements that formula. – joran Nov 27 '12 at 21:36
  • @joran I am sorry, if it seems so. That was not my meaning really. I struggled with this entire evening and thought of asking, since there just wasn't R solution for this. I thought that it would benefit others, if I asked. I made a longer question first where I explained what I had done, but deleted it, because it was too long and difficult to understand. – Mikko Nov 28 '12 at 08:11
  • I guess that there are several ways of doing this. Method presented here (http://arxiv.org/ftp/arxiv/papers/1007/1007.1012.pdf) gives the same values than flodel's answer here (http://stackoverflow.com/questions/9222056/existing-function-to-combine-standard-deviations-in-r?rq=1). The method I was looking at (above) gives slightly different values. I don't know why. Because of lack of knowledge, I'll settle with Rudmin's solution. – Mikko Nov 28 '12 at 09:57

2 Answers2

1

Rudmin (2010) states that exact variance of pooled data set is the mean of the variances plus the variance of the means. flodel has already provided an answer and function that gives similar values to Rudmin's statement. Using Rudmin's data set and flodel's function based on Wikipedia:

df <- data.frame(mean = c(30.66667, 31.14286, 40.33333), variance = c(8.555555, 13.26531, 1.555555), n = c(6,7,3))

grand.sd   <- function(S, M, N) {sqrt(weighted.mean(S^2 + M^2, N) -
                                      weighted.mean(M, N)^2)}

grand.sd(sqrt(df$variance), df$mean, df$n)^2 

#[1] 22.83983 = Dp variance in Rudmin (2010). 

However this solution gives slightly different values compared to the function 5.38 from Headrick (2010) (unless there is a mistake somewhere):

dat <- data.frame(variable = c(rep("x", 2), rep("y", 3)), replicate = c(1,2,1,2,3),
mean = c(3.4, 2.5, 6.5, 5.7, 5.1), sd = c(1.2, 0.7, 2.4, 4.0, 3.5),
n = c(3,3,5,4,6))

x <- subset(dat, variable == "x")

((x$n[1]^2)*(x$sd[1]^2)+
(x$n[2]^2)*(x$sd[2]^2)-
(x$n[2])*(x$sd[1]^2) -
(x$n[2])*(x$sd[2]^2) -
(x$n[1])*(x$sd[1]^2) -
(x$n[1])*(x$sd[2]^2) +
(x$n[1])*(x$n[2])*(x$sd[1]^2) +
(x$n[1])*(x$n[2])*(x$sd[2]^2) +
(x$n[1])*(x$n[2])*(x$mean[1] - x$mean[2])^2)/
((x$n[1] + x$n[2] - 1)*(x$n[1] + x$n[2]))

#[1] 1.015

grand.sd(x$sd, x$mean, x$n)^2

#[1] 1.1675

To answer my own question, the desired data.frame would be acquired followingly:

library(plyr)
ddply(dat, c("variable"), function(dat) c(mean=with(dat,weighted.mean(mean, n)),  sd = with(dat, grand.sd(sd, mean, n))))   

  variable     mean       sd
1        x 2.950000 1.080509
2        y 5.726667 3.382793
Community
  • 1
  • 1
Mikko
  • 7,530
  • 8
  • 55
  • 92
1

Use the sample.decomp function in the utilities package

Statistical problems of this kind are automated in the sample.decomp function in the utilities package. This function can compute pooled sample moments from subgroup moments, or compute missing subgroup moments from the other subgroup moments and pooled moments. It works for decompositions up to fourth order ---i.e., decompositions of sample size, sample mean, sample variance/standard deviation, sample skewness, and sample kurtosis.


How to use the function: Here we show how to implement the function for your dataset.

#Input sample statistics for subgroups
SIZE <- c(3, 3, 5, 4, 6)
MEAN <- c(3.4, 2.5, 6.5, 5.7, 5.1)
SD   <- c(1.2, 0.7, 2.4, 4.0, 3.5)

#Compute sample decomposition
library(utilities)
sample.decomp(n = SIZE, sample.mean = MEAN, sample.sd = SD, include.sd = TRUE)

            n sample.mean sample.sd sample.var
1           3    3.400000  1.200000   1.440000
2           3    2.500000  0.700000   0.490000
3           5    6.500000  2.400000   5.760000
4           4    5.700000  4.000000  16.000000
5           6    5.100000  3.500000  12.250000
--pooled-- 21    4.933333  2.964428   8.787833

This output gives you the pooled sample size, sample mean and sample standard deviation (or equivalently, sample variance).

Ben
  • 1,051
  • 8
  • 26