8

I want to calculate the pooled (actually weighted) standard deviation for all the unique sites in my data frame.

The values for these sites are values for single species forest stands and I want to pool the mean and the sd so that I can compare broadleaved stands with conifer stands.
This is the data frame (df) with values for the broadleaved stands:

keybl           n   mean    sd
Vest02DenmDesp  3   58.16   6.16
Vest02DenmDesp  5   54.45   7.85
Vest02DenmDesp  3   51.34   1.71
Vest02DenmDesp  3   59.57   5.11
Vest02DenmDesp  5   62.89   10.26
Vest02DenmDesp  3   77.33   2.14
Mato10GermDesp  4   41.89   12.6
Mato10GermDesp  4   11.92   1.8
Wawa07ChinDesp  18  0.097   0.004
Chen12ChinDesp  3   41.93   1.12
Hans11SwedDesp  2   1406.2  679.46
Hans11SwedDesp  2   1156.2  464.07
Hans11SwedDesp  2   4945.3  364.58

Keybl is the code for the site. The formula for the pooled SD is:

s=sqrt((n1-1)*s1^2+(n2-1)*s2^2)/(n1+n2-2))

(Sorry I can't post pictures and did not find a link that would directly go to the formula)

Where 2 is the number of groups and therefore will change depending on site. I know this is used for t-test and two groups one wants to compare. In this case I'm not planning to compare these groups. My professor suggested me to use this formula to get a weighted sd. I didn't find a R function that incorporates this formula in the way I need it, therefore I tried to build my own. I am, however, new to R and not very good at making functions and loops, therefore I hope for your help.

This is what I got so far:

sd=function (data) {
nc1=data[z,"nc"]
sc1=data[z, "sc"]
nc2=data[z+1, "nc"]
sc2=data[z+1, "sc"]
sd1=(nc1-1)*sc1^2 + (nc2-1)*sc2^2
sd2=sd1/(nc1+nc2-length(nc1))
sqrt(sd2)
}

splitdf=split(df, with(df, df$keybl), drop = TRUE)

for (c in 1:length(splitdf)) {
for (i in 1:length(splitdf[[i]])) {
    a = (splitdf[[i]])
    b =sd(a)
    }
}

1) The function itself is not correct as it gives slightly lower values than it should and I don't understand why. Could it be that it does not stop when z+1 has reached the last row? If so, how can that be corrected?

2) The loop is totally wrong but it is what I could come up with after several hours of no success.

Can anybody help me?

Thanks,

Antra

Antra
  • 85
  • 1
  • 1
  • 4

2 Answers2

6

What you're trying to do would benefit from a more general formula which will make it easier. If you didn't need to break it into pieces by the keybl variable you'd be done.

dd <- df #df is not a good name for a data.frame variable since df has a meaning in statistics

dd$df <- dd$n-1
pooledSD <- sqrt( sum(dd$sd^2 * dd$df) / sum(dd$df) )
# note, in this case I only pre-calculated df because I'll need it more than once. The sum of squares, variance, etc. are only used once.

An important general principle in R is that you use vector math as much as possible. In this trivial case it won't matter much but in order to see how to do this on large data.frame objects where compute speed is more important read on.

# First use R's vector facilities to define the variables you need for pooling.
dd$df <- dd$n-1
dd$s2 <- dd$sd^2 # sd isn't a good name for standard deviation variable even in a data.frame just because it's a bad habit to have... it's already a function and standard deviations have a standard name
dd$ss <- dd$s2 * dd$df

And now just use convenience functions for splitting and calculating the necessary sums. Note only one function is executed here in each implicit loop (*apply, aggregate, etc. are all implicit loops executing functions many times).

ds <- aggregate(ss ~ keybl, data = dd, sum)
ds$df <- tapply(dd$df, dd$keybl, sum) #two different built in methods for split apply, we could use aggregate for both if we wanted
# divide your ss by your df and voila
ds$s2 <- ds$ss / ds$df
# and also you can easly get your sd
ds$s <- sqrt(ds$s2)

And the correct answer is:

           keybl           ss df           s2          s
1 Chen12ChinDesp 2.508800e+00  2 1.254400e+00   1.120000
2 Hans11SwedDesp 8.099454e+05  3 2.699818e+05 519.597740
3 Mato10GermDesp 4.860000e+02  6 8.100000e+01   9.000000
4 Vest02DenmDesp 8.106832e+02 16 5.066770e+01   7.118125
5 Wawa07ChinDesp 2.720000e-04 17 1.600000e-05   0.004000

This looks much less concise than other methods (like 42-'s answer) but if you unroll those in terms of how many R commands are actually being executed this is much more concise. For a short problem like this either way is fine but I thought I'd show you the method that uses the most vector math. It also highlights why those convenient implicit loop functions are available, for expressiveness. If you used for loops to accomplish the same then the temptation would be stronger to put everything in the loop. This can be a bad idea in R.

John
  • 23,360
  • 7
  • 57
  • 83
  • 1
    Why not just do the `n-1` on the fly? `sqrt( sum(df$sd^2 * (df$n - 1)) / (sum(df$n - 1)) )` – atomicules Jan 09 '14 at 12:46
  • @John why is your sdpooled values different from 42's? – MAPK Jul 31 '19 at 08:29
  • Why does @42 have Inf for one of the answers? It's solvable so clearly that method is incorrect. Here's a way to double check. If all of the N's are equal then the pooled variance is just the average variance. The Hans11SwedDesp group has all equal N's average their variances and see what you get. – John Aug 01 '19 at 04:29
3

The pooled SD under the assumption of independence (so the covariance terms can be assumed to be zero) will be: sqrt( sum_over_groups[ (var)/sum(n)-N_groups)] )

     lapply( split(dat, dat$keybl), 
          function(dd) sqrt( sum( dd$sd^2 * (dd$n-1) )/(sum(dd$n-1)-nrow(dd)) ) )
#-------------------------
$Chen12ChinDesp
[1] 1.583919

$Hans11SwedDesp
[1] Inf

$Mato10GermDesp
[1] 11.0227

$Vest02DenmDesp
[1] 9.003795

$Wawa07ChinDesp
[1] 0.004123106
IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • `lapply+split` ~ `by`? ;-) – agstudy Jun 07 '13 at 02:46
  • Don't I get credit for recognizing that the cross-variance terms should be assumed to be zero? To your point: I sometimes avoid by because it would need `do.call(rbind(.))` to pull back together and it looks clunky (3 functions instead of two) but that might be OK as would `sapply(split())' here. A matter of style I think. – IRTFM Jun 07 '13 at 03:56
  • 1
    Thanks a lot. This answer works well and is much better compared to what I had in mind. I only added (dd$n-1) to the function as the sd has to be multiplied by n-1. sum( dd$sd^2 * (dd$n-1) ) – Antra Jun 07 '13 at 21:54
  • It superficially looks better but there's clearly an error given that Wawa07ChinDesp should be the same as the original value because there's nothing to pool. It seems to only be correct when df is always 2. An easy fix should be to just use dd$n-1 in the denominator sum (.../ sum(dd$n-1)). – John Aug 11 '17 at 11:57
  • Edited. Also looks to me that it might be erroneous anytime there are single-item groups. Not sure "variance" makes a lot of sense in those situations anyway. – IRTFM Aug 11 '17 at 16:52
  • @42- May I ask you why you have `sum(dd$n-1)` instead of just `sum(dd$n)`? I.e. in denominator. – MAPK Jul 31 '19 at 08:16
  • It's one of those things statisticians do to stay pure. – IRTFM Jul 31 '19 at 22:25