-1

I want to know whether the sd command in R works accurately when calculating the standard deviation of a binomial distribution.

Take the following example:

coin <- c("heads", "heads", "tails", "heads", "tails", "heads", "heads", "tails")
die <- as.factor(coin)

The standard deviation formula for such a distribution would be:

sd <- sqrt(n*p*(1-p))

where n is the number of trials, and p is the probability of success.

So we would calculate it in R as follows:

sqrt(8*(5/8)*(3/8))
[1] 1.369306

However, when we use the sd command, we get a different answer:

sd(coin)
[1] 0.5175492

Does the sd function in R not take into consideration the fact that the variable is not numeric. That explanation would make sense to me if R returned an error message, but it produces a result. Can you please clarify this for me? Thanks.

neutral
  • 107
  • 4
  • 13
  • 3
    I'm not sure how were planning to calculate the `.sd` on values such as "heads" and "tails", and I guess R not sure either. So it just calculated the `sd` of the integer storage mode of your factor. Check `as.integer(die)` and then `sd(as.integer(die))` – David Arenburg Jan 21 '16 at 21:09
  • 2
    The formula for the standard deviation applies to the underlying sampling distribution not to a particular sample. There is no way that R can look at `coin` and know that it is from a binomial distribution. As far as how R computed that standard deviation - it represented the factor as 1's and 0's (in this case it looks like 1 = heads). – Matthew Jan 21 '16 at 21:10
  • 1
    You could simulate this using something like `sd(rbinom(1e6, 8, prob = 5/8))` – David Arenburg Jan 21 '16 at 21:17
  • This Q&A started by another Frank may help: http://stats.stackexchange.com/q/29641 – Frank Jan 21 '16 at 21:22

1 Answers1

3

The sd function returns the (corrected) sample standard deviation (not the theoretic SD of a Bernoulli random variable). The sample SD is defined as sqrt( sum((x-x_bar)^2)/(N-1)). See ?sd and ?var Your example can be checked:

samp_var_die <- sum((as.numeric(die)-mean(as.numeric(die)))^2)/(length(die)-1)

samp_sd_die <- sqrt(samp_var_die)
samp_sd_die
#[1] 0.5175492

If you are interested in exploring the theoretic aspects of statistical distributions, there is a nice suite of packages devoted to this topic. Check out the distr-package and it's cousins: distrEllipse, distrEx, distrMod, distrRmetrics, distrSim, distrTeach, and RandVar. I found playing with functions and examples from those packages quite educational and entertaining.

By the way, that 1.3+ value you got would be the SD (theoretic sigma) around the estimate of 5 you would have gotten from that series of observations.

IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • Really? Isn't the mean-corrected SD going to be agnostic about that difference? (Also noting that `as.numeric` needed to be used because `mean` and `sum` are both (appropriately) fussier than `sd` about accepting a factor argument. – IRTFM Jan 21 '16 at 21:44
  • Sorry, yes you're right of course. The representation doesn't matter with the mean correction. – Gregor Thomas Jan 21 '16 at 21:49