4

I have very little knowledge of statistics, so forgive me, but I'm very confused by how the numpy function std works, and the documentation is unfortunately not clearing it up.

From what I understand it will compute the standard deviation of a distribution from the array, but when I set up a Gaussian with a standard deviation of 0.5 with the following code, numpy.std returns 0.2:

sigma = 0.5
mu = 1
x = np.linspace(0, 2, 100)
f = (1 / (sigma * np.sqrt(2 * np.pi))) * np.exp((-1 / 2) * ((x - mu) / sigma)**2)
plt.plot(x, f)
plt.show()
print(np.std(f))

This is the distribution:

enter image description here

I have no idea what I'm misunderstanding about how the function works. I thought maybe I would have to tell it the x-values associated with the y-values of the distribution but there's no argument for that in the function. Why is numpy.std not returning the actual standard deviation of my distribution?

Mad Physicist
  • 107,652
  • 25
  • 181
  • 264
Petra
  • 195
  • 1
  • 1
  • 10
  • 1
    The standard deviation of the y values is different from the standard deviation of x weighted by y – Mad Physicist Aug 25 '20 at 16:23
  • 1
    It's not clear from your question: are you aware that the typical use of `np.std` would be to apply it to a random *sample* from some (probably unknown) distribution, rather than to the graph of the PDF of a known distribution? For example, you'd expect `np.std(np.random.normal(loc=1.0, scale=0.5, size=1000))` to give a value close to `0.5`. – Mark Dickinson Aug 25 '20 at 17:06

2 Answers2

4

I suspect that you understand perfectly well how the function works, but are misunderstanding the meaning of your data. Standard deviation is a measure of the spread of data about the mean value.

When you say std(f), you are computing the the spread of the y-values about their mean. Looking at the graph in the question, a vertical mean of ~0.5 and a standard deviation of ~0.2 are not far fetched. Notice that std(f) does not involve the x-values in any way.

What you are expecting to get is the standard deviation of the x-values, weighted by the y-values. This is essentially the idea behind a probability density function (PDF).

Let's go through the computation manually to understand the differences. The mean of the x-values is normally x.sum() / x.size. But that is only true the the weight of each value is 1. If you weigh each value by the corresponding f value, you can write

m = (x * f).sum() / f.sum()

Standard deviation is the root-mean-square about the mean. That means computing the average squared deviation from the mean, and taking the square root. We can compute the weighted mean of squared deviation in the exact same way we did before:

 s = np.sqrt(np.sum((x - m)**2 * f) / f.sum())

Notice that the value of s computed this way from your question is not 0.5, but rather 0.44. This is because your PDF is incomplete, and the missing tails add significantly to the spread.

Here is an example showing that the standard deviation converges to the expected value as you compute it for a larger sample of the PDF:

>>> def s(x, y):
...     m = (x * y).sum() / y.sum()
...     return np.sqrt(np.sum((x - m)**2 * y) / y.sum())

>>> sigma = 0.5

>>> x1 = np.linspace(-1, 1, 100)
>>> y1 = (1 / (sigma * np.sqrt(2 * np.pi))) * np.exp(-0.5 * (x1 / sigma)**2)
>>> s(x1, y1)
0.4418881290522094

>>> x2 = np.linspace(-2, 2, 100)
>>> y2 = (1 / (sigma * np.sqrt(2 * np.pi))) * np.exp(-0.5 * (x2 / sigma)**2)
>>> s(x2, y2)
0.49977093783005005

>>> x3 = np.linspace(-3, 3, 100)
>>> y3 = (1 / (sigma * np.sqrt(2 * np.pi))) * np.exp(-0.5 * (x3 / sigma)**2)
>>> s(x3, y3)
0.49999998748515206
Mad Physicist
  • 107,652
  • 25
  • 181
  • 264
  • Thank you so much, that makes a lot of sense! Do you know then how I would use numpy.std() to find the standard deviation of that distribution? Or would I have to do it manually? – Petra Aug 25 '20 at 19:21
  • @Petra. You might be able to recast that formula in terms of a call to `std` and `mean` or so, but I wouldn't expect it to be worth your time. – Mad Physicist Aug 25 '20 at 20:10
  • Are you sure the denominator in the standard deviation is supposed to be the sum of f and not its size, given that the formula for standard deviation has the denominator N as in N sample points? – Petra Aug 31 '20 at 16:30
  • @Petra. Think of the standard deviation as the square root of an average. The fact that it is the average squared error is irrelevant for the moment. If you accept the reasoning behind the definition of weighted mean `m = (x * f).sum() / f.sum()`, then weighted standard deviation is defined in the same way, but with `(x - m)**2` in place of just `x`. Try to understand the definitions intuitively rather than just thinking of them as formulas when you first learn them. – Mad Physicist Aug 31 '20 at 16:38
  • I'm sorry- I understand the variance (and therefore the standard deviation) as the average distance from the mean (which is why as I understand it, it's weighted by the number of sample points). Is this incorrect? I'm just confused as to why the formula in your answer is weighted by the sum of the probabilities rather than the number of sample points (though I can see by the results that it's correct). – Petra Aug 31 '20 at 16:44
  • @Petra. Think of your curve applied to 1million samples. There are 100K samples in the 0.0 bin, 500k in the 0.5 bin, 800k in the 1.0 bin, etc. The numbers are a bit off, but you get the idea. Now when you take the average of the distances, you have to account for the total number of samples that they represent. The factor of `f` in the numerator is just that: 800k at 1.0, 500k at 0.5, etc. The denominator is of course the total number of samples: `f.sum()`, or ~1million – Mad Physicist Aug 31 '20 at 16:53
  • Sorry, it's the denominator I'm confused about. As far as I'm seeing it, f(x) are the probabilities associated with the sample point x. Therefore would the total number of samples not be the number of sample points x i.e. `np.size(x)`? I'm not sure I understand why the total number of samples is the sum of the probabilities. – Petra Aug 31 '20 at 16:58
  • @Petra. Your chart is a histogram, not the actual data. The total number of samples is the sum of all the bins: `f.sum()` (up to some scale factor of course) – Mad Physicist Aug 31 '20 at 17:00
  • I think I understand better, thank you. I was trying to apply this to a distribution with nonuniform bins, and forgot that I would have to divide by the integral of the probability distribution in that case rather than simply the sum. Thank you very much for your time! – Petra Aug 31 '20 at 17:15
-2

np.std is used to compute standard deviation. This can be computed as below steps

  1. First we need to compute mean of distribution
  2. Then find the summation of (x - x.mean)**2
  3. Then find the means of above summation (by divide with number of elements in distribution)
  4. Then find square root of this means (calculated in step 3).

Thus this function is calculating the standard deviation of distribution being passed to it.

Dhiraj Bansal
  • 417
  • 3
  • 8