3

I have some code and do not understand why applying np.std delivers two different results.

import numpy as np
import pandas as pd
a = np.array([ 1.5,  6. ,  7. ,  4.5])
print 'mean value is:', a.mean()
print 'standard deviation is:', np.std(a)

Next lines should basically do the same just in a pandas dataframe

base = datetime.datetime(2000, 1, 1)
arr = np.array([base + datetime.timedelta(days=i) for i in xrange(4)])
index_date = pd.Index(arr, name = 'dates')
data_gas = pd.DataFrame(a, index_date, columns=['value'], dtype=float)
mean_pandas = data_gas.resample('M').mean()
standard_deviation = data_gas.resample('M').apply(np.std)
print mean_pandas
print standard_deviation

From the documentation of np.std I can read: "...By default ddof is zero." (ddof=delta degrees of freedom).

np.std(a) delivers the standard deviation where the divisor is N (=number of values), ...resample('M').apply(np.std) delivers the standard deviation where the divisor is N minus 1. What causes this difference?

paulchen
  • 1,009
  • 1
  • 10
  • 17
  • Could you share the values you're getting in each case? – Tim B Mar 01 '17 at 15:37
  • np.std(a) results in 2.0767 and standard_deviation delivers 2.3979 – paulchen Mar 01 '17 at 15:46
  • 1
    So if I understand correctly, your question is "why does `.apply(np.std)` calculate using ddof=1, despite `np.std` itself using ddof=0?". Is that the correct interpretation? – Alex Riley Mar 01 '17 at 16:00

1 Answers1

2

By default numpy uses the population standard deviation, which as you note has a divisor of N, where N is the number of values. This is used if you have a complete data set.

The pandas version is calculating the sample standard deviation. This has a divisor of N-1, and is used when you have a subset of data from a larger set. This can be achieved in numpy by np.std(a, ddof=1).

As an example, you would use sample standard deviation if you wanted to measure the standard deviation of shoe sizes in your city. It isn't feasible to measure everyones size, so you are using a sample of 100 shoe size measurements you took from people in the street. In this case you are using your (hopefully random) sample of data to model a larger set. In most cases I would say sample standard deviation is what you want.

If you didn't want to generalise your results to the whole city, but instead wanted to find the standard deviation of just this sample of 100 sizes, you would use population standard deviation.

Tim B
  • 3,033
  • 1
  • 23
  • 28
  • Thanks a lot. I think the two most important sentences of your answer are: Divisor `N` is used with a complete data set. `N-1` is used when you have a subset of data from a larger set (which is ultimately true in my case; I would like to calculate the sample standard deviation of monthly means of a lot of months...) Thanks again. – paulchen Mar 01 '17 at 16:02