20

I am using pandas.rolling_apply to fit data to a distribution and get a value from it, but I need it also report a rolling goodness of fit (specifically, p-value). Currently I'm doing it like this:

def func(sample):
    fit = genextreme.fit(sample)
    return genextreme.isf(0.9, *fit)

def p_value(sample):
    fit = genextreme.fit(sample)
    return kstest(sample, 'genextreme', fit)[1]

values = pd.rolling_apply(data, 30, func)
p_values = pd.rolling_apply(data, 30, p_value)
results = pd.DataFrame({'values': values, 'p_value': p_values})

The problem is that I have a lot of data, and the fit function is expensive, so I don't want to call it twice for every sample. What I'd rather do is something like this:

def func(sample):
    fit = genextreme.fit(sample)
    value = genextreme.isf(0.9, *fit)
    p_value = kstest(sample, 'genextreme', fit)[1]
    return {'value': value, 'p_value': p_value}

results = pd.rolling_apply(data, 30, func)

Where results is a DataFrame with two columns. If I try to run this, I get an exception: TypeError: a float is required. Is it possible to achieve this, and if so, how?

aquavitae
  • 17,414
  • 11
  • 63
  • 106
  • Does it work if you return a Series rather than a dict? – Andy Hayden Mar 06 '14 at 08:11
  • @AndyHayden No, That gives `TypeError: cannot convert the series to ` – aquavitae Mar 06 '14 at 08:12
  • see this question http://stackoverflow.com/questions/19121854/using-rolling-apply-on-a-dataframe-object – Jeff Mar 06 '14 at 11:08
  • 1
    @Jeff That's a different question. That's about taking in two inputs. This question about is about giving 2 outputs. – RAY May 12 '16 at 01:55
  • 1
    Has anyone given you a good answer yet? I can write my own more generic roller but would prefer if there's a standard solution to this. – RAY May 12 '16 at 01:56
  • this is not real well supported; you can just do your own loop – Jeff May 12 '16 at 02:05

4 Answers4

5

I had a similar problem and solved it by using a member function of a separate helper class during apply. That member function does as required return a single value but I store the other calc results as members of the class and can use it afterwards.

Simple Example:

class CountCalls:
    def __init__(self):
        self.counter = 0

    def your_function(self, window):
        retval = f(window)
        self.counter = self.counter + 1


TestCounter = CountCalls()

pandas.Series.rolling(your_seriesOrDataframeColumn, window = your_window_size).apply(TestCounter.your_function)

print TestCounter.counter

Assume your function f would return a tuple of two values v1,v2. Then you can return v1 and assign it to column_v1 to your dataframe. The second value v2 you simply accumulate in a Series series_val2 within the helper class. Afterwards you just assing that series as new column to your dataframe. JML

Antony
  • 5,414
  • 7
  • 27
  • 32
JML64
  • 51
  • 1
  • 3
5

I had a similar problem before. Here's my solution for it:

from collections import deque
class your_multi_output_function_class:
    def __init__(self):
        self.deque_2 = deque()
        self.deque_3 = deque()

    def f1(self, window):
        self.k = somefunction(y)
        self.deque_2.append(self.k[1])
        self.deque_3.append(self.k[2])
        return self.k[0]    

    def f2(self, window):
        return self.deque_2.popleft()   
    def f3(self, window):
        return self.deque_3.popleft() 

func = your_multi_output_function_class()

output = your_pandas_object.rolling(window=10).agg(
    {'a':func.f1,'b':func.f2,'c':func.f3}
    )
Yi Yu
  • 51
  • 1
  • 3
2

I used and loved @yi-yu's answer so I made it generic:

from collections import deque
from functools import partial

def make_class(func, dim_output):

    class your_multi_output_function_class:
        def __init__(self, func, dim_output):
            assert dim_output >= 2
            self.func = func
            self.deques = {i: deque() for i in range(1, dim_output)}

        def f0(self, *args, **kwargs):
            k = self.func(*args, **kwargs)
            for queue in sorted(self.deques):
                self.deques[queue].append(k[queue])
            return k[0]

    def accessor(self, index, *args, **kwargs):
        return self.deques[index].popleft()

    klass = your_multi_output_function_class(func, dim_output)

    for i in range(1, dim_output):
        f = partial(accessor, klass, i)
        setattr(klass, 'f' + str(i), f)

    return klass

and given a function f of a pandas Series (windowed but not necessarily) returning, n values, you use it this way:

rolling_func = make_class(f, n)
# dict to map the function's outputs to new columns. Eg:
agger = {'output_' + str(i): getattr(rolling_func, 'f' + str(i)) for i in range(n)} 
windowed_series.agg(agger)
Alex
  • 579
  • 3
  • 13
  • 2
    I could not get this to work in my situation. I was getting `IndexError: pop from an empty deque`. You also forgot to import `partial` from `functools`. – wordsforthewise Oct 19 '18 at 03:35
1

I also had the same issue. I solved it by generating a global data frame and feeding it from the rolling function. In the following example script, I generate a random input data. Then, I calculate with a single rolling apply function the min, the max and the mean.

import pandas as pd
import numpy as np

global outputDF
global index

def myFunction(array):

    global index
    global outputDF

    # Some random operation
    outputDF['min'][index] = np.nanmin(array)
    outputDF['max'][index] = np.nanmax(array)
    outputDF['mean'][index] = np.nanmean(array)

    index += 1
    # Returning a useless variable
    return 0

if __name__ == "__main__":

    global outputDF
    global index

    # A random window size
    windowSize = 10

    # Preparing some random input data
    inputDF = pd.DataFrame({ 'randomValue': [np.nan] * 500 })
    for i in range(len(inputDF)):
        inputDF['randomValue'].values[i] = np.random.rand()


    # Pre-Allocate memory
    outputDF = pd.DataFrame({ 'min': [np.nan] * len(inputDF),
                              'max': [np.nan] * len(inputDF),
                              'mean': [np.nan] * len(inputDF)
                              })   

    # Precise the staring index (due to the window size)
    d = (windowSize - 1) / 2
    index = np.int(np.floor( d ) )

    # Do the rolling apply here
    inputDF['randomValue'].rolling(window=windowSize,center=True).apply(myFunction,args=())

    assert index + np.int(np.ceil(d)) == len(inputDF), 'Length mismatch'

    outputDF.set_index = inputDF.index

    # Optional : Clean the nulls
    outputDF.dropna(inplace=True)

    print(outputDF)