3

I have some code and do not understand why the difference occurs:

np.std() which default ddof=0,when it's used alone.

but why when it's used as an argument in pivot_table(aggfunc=np.std),it changes into ddof=1 automatically.

import numpys as np
import pandas as pd
dft = pd.DataFrame({'A': ['one', 'one'],
               'B': ['A', 'A'],
               'C': ['bar', 'bar'],
               'D': [-0.866740402,1.490732028]})



np.std(dft['D'])
#equivalent:np.std([-0.866740402,1.490732028]) (which:defaualt ddof=0) 
#the result: 1.178736215

dft.pivot_table(index=['A', 'B'],columns='C',aggfunc=np.std)
#equivalent:np.std([-0.866740402,1.490732028],ddof=1) 
#the result:1.666985
ALollz
  • 57,915
  • 7
  • 66
  • 89
SirenL
  • 33
  • 5
  • 1
    My guess is that `np.std` [gets dispatched](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html#dispatching-to-instance-methods) to groupby.GroupBy.std but since pandas and numpy use different default ddof that's the problem. I can never find how the dispatching is done in the source code though :/ – ALollz Mar 12 '20 at 04:12

1 Answers1

2

pivot uses DataFrame.groupby.agg and when you supply an aggregation function it's going to try to figure out exactly how to _aggregate.

arg=np.std will get handled here, the relevant code being

f = self._get_cython_func(arg)
if f and not args and not kwargs:
    return getattr(self, f)(), None

Hidden in the DataFrame class is this table:

pd.DataFrame()._cython_table
#OrderedDict([(<function sum>, 'sum'),
#             (<function max>, 'max'),
#             ...
#             (<function numpy.std>, 'std'),
#             (<function numpy.nancumsum>, 'cumsum')])

pd.DataFrame()._cython_table.get(np.std)
#'std'

And so np.std is only used to select the attribute to call, the default ddof are completely ignored, and instead the pandas default of ddof=1 is used.

getattr(dft['D'], 'std')()
#1.6669847417133286
ALollz
  • 57,915
  • 7
  • 66
  • 89
  • Thank you very much. And, from the aspect of application,does it mean that there is no way to get Standard Deviation of Population(ddof=0) when we use the pivot_table, since the ddof=1 is fixed for arg=np.std ,and users cannot change it freely ? – SirenL Mar 12 '20 at 07:36
  • 1
    @SirenL you can either use `lambda x: np.std(x)` as your aggfunc, thought that might get slow for larger data. The alternative is to use a `groupby` + `unstack` to get the same reshaping as a pivot: `dft.groupby(['A', 'B', 'C'])['D'].std(ddof=0).unstack(-1)` – ALollz Mar 12 '20 at 14:13