Why is the var() function giving me a different answer than my calculated variance?

Question

I wasn't sure if this should go in SO or some other .SE, so I will delete if this is deemed to be off-topic.

I have a vector and I'm trying to calculate the variance "by hand" (meaning based on the definition of variance but still performing the calculations in R) using the equation: V[X] = E[X^2] - E[X]^2 where E[X] = sum (x * f(x)) and E[X^2] = sum (x^2 * f(x))

However, my calculated variance is different from the var() function that R has (which I was using to check my work). Why is the var() function different? How is it calculating variance? I've checked my calculations several times so I'm fairly confident in the value I calculated. My code is provided below.

vec <- c(3, 5, 4, 3, 6, 7, 3, 6, 4, 6, 3, 4, 1, 3, 4, 4)
range(vec)
counts <- hist(vec + .01, breaks = 7)$counts
fx <- counts / (sum(counts)) #the pmf f(x)
x <- c(min(vec): max(vec)) #the values of x
exp <- sum(x * fx) ; exp #expected value of x
exp.square <- sum(x^2 * fx) #expected value of x^2
var <- exp.square - (exp)^2 ; var #calculated variance
var(vec)

This gives me a calculated variance of 2.234 but the var() function says the variance is 2.383.

hint: `your_var*(n/(n-1)) = 2.234*(16/15) = 2.383 = var(data)`. — Ben Bolker, Feb 20 '15 at 20:55
Likely because you need to calculate the **[unbiased estimator](http://en.wikipedia.org/wiki/Bias_of_an_estimator#Sample_variance)** of the variance. — BrodieG, Feb 20 '15 at 20:55

Sven Hohenstein · Accepted Answer · 2016-11-02T19:55:49.943

10

While V[X] = E[X^2] - E[X]^2 is the population variance (when the values in the vector are the whole population, not just a sample), the var function calculates an estimator for the population variance (the sample variance).

edited Nov 02 '16 at 19:55

answered Feb 20 '15 at 20:55

Sven Hohenstein

80,497
17
145
168

Ahh I see. So if I'm asked to find the "estimated variance", does that imply population variance and not sample variance? – pocketlizard Feb 20 '15 at 21:00
2

It implies an estimate of the population variance; the population variance is a parameter, not something estimated. You should probably ask the person who asked you what they meant. [Wikipedia](http://en.wikipedia.org/wiki/Variance#Population_variance_and_sample_variance) has more, but I don't think its explanation is super-clear ... – Ben Bolker Feb 20 '15 at 21:03
1

@Sven Hohenstein `var` is "a **estimator** for the _population variance_", this means `var` is _sample variance_, because they use denominator n - 1, isn't it? – peterchen932 Jul 24 '15 at 02:31
@peterchen932 This is true. The `var` function calculates population variance. – Sven Hohenstein Jul 24 '15 at 04:59
1

According to `var`, "The denominator n - 1 is used", so the result should equal the calculation of the sample variance. – Waldir Leoncio Nov 02 '16 at 10:11
@WaldirLeoncio I changed the wording of my answer. – Sven Hohenstein Nov 02 '16 at 19:56
I was a bit peeved with the term "an estimator", which is a vague term because pretty much any statistic can be an estimator, but your rewording makes it much clearer, thanks for taking the time to edit. – Waldir Leoncio Nov 02 '16 at 20:00

wiwh · Answer 2 · 2016-05-22T11:21:55.763

While this has been answered already, I fear some may still be confused between population variance and its estimate from a sample, and this may be due to the example.

If the vector vec represents the full population, then vec is simply a way to represent the distribution function, which can be summarized more succinctly in the pmf that you derived from it. Crucially, the elements of vec in this case are not random variables. In this case, your computations of E[X] and var[X] from the pmf are correct.

Most of the time, however, when you have data (for instance in the form of a vector) it is a random sample from the underlying population. Each element of the vector is the observed value of a random variable: it is a "draw" from the population. For this example, it is fair to assume that each element is drawn independently, from the same distribution ("iid"). In practice, this random sampling means that you cannot compute the true pmf, as you may have some variations due merely to chance. Likewise, you can't get the true value of E[X], E[X^2], and thus Var[X], from the sample. These values need to be estimated. The sample average is usually a good estimate for E[X] (in particular, it is unbiased), but it turns out that the sample variance is a biased estimate for the population variance. To correct for this bias, you need to multiply it by the factor n/(n-1).

As this latter case is the most seen in practice (aside from textbook exercises), it is what is computed when you call the var() function in R. So if you're asked to find the "estimated variance", it most likely implies that your vector vec is a random sample and that you fall in this latter case. If this was the original question, then you have your answer, and I hope it becomes clear that the choice of the name of variables and the commenting in your code can lead to confusion: indeed, you cannot compute the pmf, the expected value or the variance of the population from a random sample: what you can get are their estimates, and one of them -- that of the variance -- is biased.

I wanted to clarify this, as this confusion, as seen in the coding, is very common when first being acquainted with these concepts. In particular, the accepted answer may be misleading: V[X] = E[X^2] - E[X]^2 is not the sample variance; it is indeed the population variance, which you cannot get from the random sample. If you replace the values in this equation by their sample estimate (as averages), you will get sample.V[X] = average[X^2] - average[X]^2, which is the sample variance, and is biased.

Some may say that I am picky on the semantics; however, the "abuse of notation" in the accepted answer is only acceptable when everybody recognizes it as such. However, for those trying to figure out these conceptual differences, I believe it is best to remain precise.

score 2 · Answer 3 · answered Nov 12 '17 at 19:25

Here's one way to calculate "estimated population variance" that matches the output of the var function in the stats package:

vec <- c(3, 5, 4, 3, 6, 7, 3, 6, 4, 6, 3, 4, 1, 3, 4, 4)
n <- length(vec)
average <- mean(vec)
differences <- vec - average
squared.differences <- differences^2
sum.of.squared.differences <-  sum(squared.differences)
estimator <- 1/(n - 1)
estimated.variance <- estimator * sum.of.squared.differences
estimated.variance
[1] 2.383333
var(vec) == estimated.variance # The "hand calculated" variance equals the variance in the stats package.
[1] TRUE

I wonder what folks think about labelling the term "estimator."

In a function (that's unlikely to handle errors and anomalies as well as the var function in the stats package):

estimated.variance.by.hand <- function (x){
  n <- length(x)
  average <- mean(x)
  differences <- x - average
  squared.differences <- differences^2
  sum.of.squared.differences <-  sum(squared.differences)
  estimator <- 1/(n - 1)
  est.variance <- estimator * sum.of.squared.differences
  est.variance
}
estimated.variance.by.hand(vec)
estimated.variance.by.hand(1:10)
var(1:10)
estimated.variance.by.hand(1:100)
var(1:100)

SeGa · Answer 4 · 2018-06-23T19:25:12.213

The R-base var() takes N-1 in the denominator, to get a more reliable (less biased) estimator of the variance. Unfortunely there is no option to tell var() to take N instead, so I wrote my own variance function for that case.

var_N = function(x){var(x)*(length(x)-1)/length(x)}

And some code to illustrate the function above, the base function, the manual way and @dca's estimated.variance.by.hand() function:

## Data
x = c(4,5,6,7,8,2,4,6,6)
mean_x = mean(x)


## Variance with N-1 in denominator
var(x)
sum((x - mean_x) ^2) / (length(x) - 1)
estimated.variance.by.hand(x)


## Variance with N in denominator
sum((x - mean_x) ^2) / length(x)
var(x) * (length(x) - 1) / length(x)
var_N = function(x){var(x)*(length(x)-1)/length(x)}
var_N(x)

score 0 · Answer 5 · answered Jan 09 '23 at 11:26

0

Var() function calculates the sample variance. If you want population variance, you should multiply it with ((n-1)/n).

Assuming x1 is the array :

#calculate length

n<-length(x1)

#calculate pop variance

var(x1)*((n-1)/n)

answered Jan 09 '23 at 11:26

Snehal Bhartiya

11
1
3

2

What does this answer add to the existing answers ... ? – Ben Bolker Jan 09 '23 at 14:34

Why is the var() function giving me a different answer than my calculated variance?

5 Answers5

Linked

Related