7

I was wondering how to calculate skewness and kurtosis correctly in pandas. Pandas gives some values for skew() and kurtosis() values but they seem much different from scipy.stats values. Which one to trust pandas or scipy.stats?

Here is my code:

import numpy as np
import scipy.stats as stats
import pandas as pd

np.random.seed(100)
x = np.random.normal(size=(20))

kurtosis_scipy = stats.kurtosis(x)
kurtosis_pandas = pd.DataFrame(x).kurtosis()[0]

print(kurtosis_scipy, kurtosis_pandas)
# -0.5270409758168872
# -0.31467107631025604

skew_scipy = stats.skew(x)
skew_pandas = pd.DataFrame(x).skew()[0]

print(skew_scipy, skew_pandas)
# -0.41070929017558555
# -0.44478877631598901

Versions:

print(np.__version__, pd.__version__, scipy.__version__)
1.11.0 0.20.0 0.19.0
BhishanPoudel
  • 15,974
  • 21
  • 108
  • 169

2 Answers2

8

bias=False

print(
    stats.kurtosis(x, bias=False), pd.DataFrame(x).kurtosis()[0],
    stats.skew(x, bias=False), pd.DataFrame(x).skew()[0],
    sep='\n'
)

-0.31467107631025515
-0.31467107631025604
-0.4447887763159889
-0.444788776315989
piRSquared
  • 285,575
  • 57
  • 475
  • 624
4

Pandas calculate UNBIASED estimator of the population kurtosis. Look at the Wikipedia for formulas: https://www.wikiwand.com/en/Kurtosis

enter image description here

Calculate kurtosis from scratch

import numpy as np
import pandas as pd
import scipy

x = np.array([0, 3, 4, 1, 2, 3, 0, 2, 1, 3, 2, 0,
              2, 2, 3, 2, 5, 2, 3, 999])
xbar = np.mean(x)
n = x.size
k2 = x.var(ddof=1) # default numpy is biased, ddof = 0
sum_term = ((x-xbar)**4).sum()
factor = (n+1) * n / (n-1) / (n-2) / (n-3)
second = - 3 * (n-1) * (n-1) / (n-2) / (n-3)

first = factor * sum_term / k2 / k2

G2 = first + second
G2 # 19.998428728659768

Calculate kurtosis using numpy/scipy

scipy.stats.kurtosis(x,bias=False) # 19.998428728659757

Calculate kurtosis using pandas

pd.DataFrame(x).kurtosis() # 19.998429

Similarly, you can also calculate skewness.

BhishanPoudel
  • 15,974
  • 21
  • 108
  • 169