2

Say I already have a PDF (probability density function) in Pandas DataFrame.

import pandas as pd
import numpy as np
from scipy import stats

df = pd.DataFrame([1,2,3,4,5,6,5,4,3,2], index=np.linspace(21,30,10), columns=['days'])
df.index.names=['temperature']
print(df)
             days
temperature      
21.0            1
22.0            2
23.0            3
24.0            4
25.0            5
26.0            6
27.0            5
28.0            4
29.0            3
30.0            2

If I wanted to calculate metrics like skewness, I have to convert the PDF back to raw data like this:

temp_history = []
for i in df.iterrows():
    temp_history += i[1][0] * [i[0]]

print(temp_history)
[21.0, 22.0, 22.0, 23.0, 23.0, 23.0, 24.0, 24.0, 24.0, 24.0, 25.0, 25.0, 25.0, 25.0, 25.0, 26.0, 26.0, 26.0, 26.0, 26.0, 26.0, 27.0, 27.0, 27.0, 27.0, 27.0, 28.0, 28.0, 28.0, 28.0, 29.0, 29.0, 29.0, 30.0, 30.0]

skew = stats.skew(temp_history)

Is there anyway I can calculate the metrics without having to create temp_history ? Thanks!

Edit: The reason I want to avoid creating a raw data in any form is that I don't want to lose a huge chunk of memory simply when the numbers in the days column get bigger.

cf1
  • 67
  • 4

1 Answers1

2

Use -

df.reindex(df.index.repeat(df['days'])).reset_index()['temperature'].skew()

OR

To stick to your original implementation -

stats.skew(df.reindex(df.index.repeat(df['days'])).reset_index()['temperature'])

And if you are wondering why the outputs won't match, it's discussed here

For matching both, set bias=False in stats.skew()

Vivek Kalyanarangan
  • 8,951
  • 1
  • 23
  • 42
  • Many thanks! So I guess there is really no way to calculate metrics directly on the pdf dataframe? My only concern is that the performance might suffer when the numbers in 'days' get very large. – cf1 Sep 07 '20 at 20:20