Why does pandas.DataFrame.skew() return 0 when the SD of a list of values is 0?

Asked Jan 14 '22 at 08:26

Active Jan 14 '22 at 09:19

Viewed 602 times

0

Background

Let's think, there is a list of values which presents activity of a person for several hours. That person did not have any movement in those hours. Therefore, all the values are 0.

What did raise the question?

Searching on Google, I found the following formula of skewness. The same formula is available in some other sites also. In the denominator part, Standard Deviation (SD) is included. For a list of similar non-zero values (e.g., [1, 1, 1]) and also for 0 values (i.e., [0, 0, 0]), the SD will be 0. Therefore, I am supposed to get NaN (something divided by 0) for skewness. Surprisingly, I get 0 while calling pandas.DataFrame.skew().

My Question

Why does pandas.DataFrame.skew() return 0 when the SD of a list of values is 0?

Minimum Reproducible Example

import pandas as pd
ot_df = pd.DataFrame(data={'Day 1': [0, 0, 0, 0, 0, 0],
                           'Day 2': [0, 0, 0, 0, 0, 0],
                           'Day 3': [0, 0, 0, 0, 0, 0]})
print(ot_df.skew(axis=1))

Note: I have checked several Q&A of this site (e.g., this one (How does pandas calculate skew?)) and others (e.g., this one of GitHub). But I did not find the answer of my question.

asked Jan 14 '22 at 08:26

Md. Sabbir Ahmed

850
8
22

1 Answers1

2

You can find the implementation here: https://github.com/pandas-dev/pandas/blob/main/pandas/core/nanops.py

As you can see there is a:

    with np.errstate(invalid="ignore", divide="ignore"):
        result = (count * (count - 1) ** 0.5 / (count - 2)) * (m3 / m2 ** 1.5)

    dtype = values.dtype
    if is_float_dtype(dtype):
        result = result.astype(dtype)

    if isinstance(result, np.ndarray):
        result = np.where(m2 == 0, 0, result)
        result[count < 3] = np.nan
    else:
        result = 0 if m2 == 0 else result
        if count < 3:
            return np.nan

As you can see if m2 (which will be equal 0 for all constant values) is 0, then the result will be 0.

If you are asking why it is implemented this way, I can only speculate. I suppose, that it is done for practical reasons - if you are calculating the skewness you want to check if the distribution of variables is symetrical (and you can argue, that it indeed is: https://stats.stackexchange.com/questions/114823/skewness-of-a-random-variable-that-have-zero-variance-and-zero-third-central-mom).

EDIT: It was done due to: https://github.com/pandas-dev/pandas/issues/11974 https://github.com/pandas-dev/pandas/pull/12121

Probably you could add an issue for adding a flag on behaviour of this method in case of constant value of variable. It should be easy to fix.

edited Jan 14 '22 at 09:19

answered Jan 14 '22 at 09:10

Daniel Wlazło

1,105
1
8
17

Thank you so much Daniel Wlazło for your answer. By going through your provided links of Github [1](https://github.com/pandas-dev/pandas/issues/11974), [2](https://github.com/pandas-dev/pandas/pull/12121), I do not find the reasons. Did I miss anything? – Md. Sabbir Ahmed Jan 14 '22 at 11:18
1

It was done to standardize the output of the method when the float number was inputted. Before this update, you could get 3 different behaviours (which was random) for "constant" float series: very high/low number, zero or NAN. That's why they changed the way of calculating the skewness and added the inputting 0 for m2 == 0. – Daniel Wlazło Jan 14 '22 at 11:36
Thank you again. Can you please share your thoughts regarding my another question: https://stats.stackexchange.com/questions/560373/how-can-i-quantify-the-skewness-kurtosis-entropy-when-all-of-values-of-a-list? – Md. Sabbir Ahmed Jan 14 '22 at 11:45
1

If you want to use those values for machine learning the most important thing is to have a consistent approach across rows. Model won't magically know that it is a skewness or curtosis. If this is a very common situation and you think it might be meaningful one I would just add the new column to annotate it somehow (fe. 0/1 Flag if the value was constant) so the model could learn it easier. – Daniel Wlazło Jan 14 '22 at 13:30