Using the sd command in R with a binomially distributed variable

Question

I want to know whether the sd command in R works accurately when calculating the standard deviation of a binomial distribution.

Take the following example:

coin <- c("heads", "heads", "tails", "heads", "tails", "heads", "heads", "tails")
die <- as.factor(coin)

The standard deviation formula for such a distribution would be:

sd <- sqrt(n*p*(1-p))

where n is the number of trials, and p is the probability of success.

So we would calculate it in R as follows:

sqrt(8*(5/8)*(3/8))
[1] 1.369306

However, when we use the sd command, we get a different answer:

sd(coin)
[1] 0.5175492

Does the sd function in R not take into consideration the fact that the variable is not numeric. That explanation would make sense to me if R returned an error message, but it produces a result. Can you please clarify this for me? Thanks.

I'm not sure how were planning to calculate the `.sd` on values such as "heads" and "tails", and I guess R not sure either. So it just calculated the `sd` of the integer storage mode of your factor. Check `as.integer(die)` and then `sd(as.integer(die))` — David Arenburg, Jan 21 '16 at 21:09
The formula for the standard deviation applies to the underlying sampling distribution not to a particular sample. There is no way that R can look at `coin` and know that it is from a binomial distribution. As far as how R computed that standard deviation - it represented the factor as 1's and 0's (in this case it looks like 1 = heads). — Matthew, Jan 21 '16 at 21:10
You could simulate this using something like `sd(rbinom(1e6, 8, prob = 5/8))` — David Arenburg, Jan 21 '16 at 21:17
This Q&A started by another Frank may help: http://stats.stackexchange.com/q/29641 — Frank, Jan 21 '16 at 21:22

IRTFM · Answer 1 · 2016-01-21T21:55:50.763

The sd function returns the (corrected) sample standard deviation (not the theoretic SD of a Bernoulli random variable). The sample SD is defined as sqrt( sum((x-x_bar)^2)/(N-1)). See ?sd and ?var Your example can be checked:

samp_var_die <- sum((as.numeric(die)-mean(as.numeric(die)))^2)/(length(die)-1)

samp_sd_die <- sqrt(samp_var_die)
samp_sd_die
#[1] 0.5175492

If you are interested in exploring the theoretic aspects of statistical distributions, there is a nice suite of packages devoted to this topic. Check out the distr-package and it's cousins: distrEllipse, distrEx, distrMod, distrRmetrics, distrSim, distrTeach, and RandVar. I found playing with functions and examples from those packages quite educational and entertaining.

By the way, that 1.3+ value you got would be the SD (theoretic sigma) around the estimate of 5 you would have gotten from that series of observations.

Really? Isn't the mean-corrected SD going to be agnostic about that difference? (Also noting that `as.numeric` needed to be used because `mean` and `sum` are both (appropriately) fussier than `sd` about accepting a factor argument. — IRTFM, Jan 21 '16 at 21:44
Sorry, yes you're right of course. The representation doesn't matter with the mean correction. — Gregor Thomas, Jan 21 '16 at 21:49

Using the sd command in R with a binomially distributed variable

1 Answers1