Automating fill with calculation function in python

Question

What I got so far is the code below and it works fine and brings the results it should: It fills df['c'] with the calculation previous c * b if there is no c given. The problem is that I have to apply this to a bigger data set len(df.index) = ca. 10.000, so the function I have so far is inappropriate since I would have to write a couple of thousand times: df['c'] = df.apply(func, axis =1). A while loop is no option in pandas for this size of dataset. Any ideas?

import pandas as pd
import numpy as np
import datetime

randn = np.random.randn
rng = pd.date_range('1/1/2011', periods=10, freq='D')

df = pd.DataFrame({'a': [None] * 10, 'b': [2, 3, 10, 3, 5, 8, 4, 1, 2, 6]},index=rng)
df["c"] =np.NaN

df["c"][0] = 1
df["c"][2] = 3


def func(x):
    if pd.notnull(x['c']):
        return x['c']
    else:
        return df.iloc[df.index.get_loc(x.name) - 1]['c'] * x['b']

df['c'] = df.apply(func, axis =1)
df['c'] = df.apply(func, axis =1)
df['c'] = df.apply(func, axis =1)
df['c'] = df.apply(func, axis =1)
df['c'] = df.apply(func, axis =1)
df['c'] = df.apply(func, axis =1)
df['c'] = df.apply(func, axis =1)

The solution I am applying I foud here: http://stackoverflow.com/questions/30641509/forwardfill-combined-with-calculation-method-ffill-xyz-in-python-pandas — sorownas, Jun 06 '15 at 00:57
@boardrider What do you think is the problem with this question? — chrisaycock, Jun 06 '15 at 00:58
@hb.klein Your edit does not make this question the least bit more searchable. The tags "python" and "pandas" are already on this post, which is how I and everyone else will find it. Also, Stack Overflow already prepends the tag to every page title so Google will handle it better. — chrisaycock, Jun 06 '15 at 01:08
Difficult to get a clear idea what your issue really is. Try to re-formulate as: I want to do A, I tried B,C, and D, Here's a minimal code E that demonstrate my problem, and here're the exceptions I'm getting on input F or G. — boardrider, Jun 06 '15 at 01:16

score 4 · Accepted Answer · answered Jun 06 '15 at 01:31

Here is a nice way of solving a recurrence problem. There will be docs on this in v0.16.2 (releasing next week). See docs for numba

This will be quite performant as the real heavy lifting is done in fast jit-ted compiled code.

import pandas as pd
import numpy as np
from numba import jit

rng = pd.date_range('1/1/2011', periods=10, freq='D')
df = pd.DataFrame({'a': np.nan * 10, 'b': [2, 3, 10, 3, 5, 8, 4, 1, 2, 6]},index=rng)
df.ix[0,"c"] = 1
df.ix[2,"c"] = 3

@jit
def ffill(arr_b, arr_c):

    n = len(arr_b)
    assert len(arr_b) == len(arr_c)
    result = arr_c.copy()

    for i in range(1,n):
        if not np.isnan(arr_c[i]):
            result[i] = arr_c[i]
        else:
            result[i] = result[i-1]*arr_b[i]

    return result

df['d'] = ffill(df.b.values, df.c.values)

             a   b   c      d
2011-01-01 NaN   2   1      1
2011-01-02 NaN   3 NaN      3
2011-01-03 NaN  10   3      3
2011-01-04 NaN   3 NaN      9
2011-01-05 NaN   5 NaN     45
2011-01-06 NaN   8 NaN    360
2011-01-07 NaN   4 NaN   1440
2011-01-08 NaN   1 NaN   1440
2011-01-09 NaN   2 NaN   2880
2011-01-10 NaN   6 NaN  17280

Could you kindly look at this question, it's quite similar. We just can't find a solution: http://stackoverflow.com/questions/30745160/portfolio-rebalancing-with-bandwidth-method-in-python — sorownas, Jun 12 '15 at 08:45

unutbu · Answer 2 · 2015-06-06T02:04:42.887

If you print out the values of df in a for-loop:

for i in range(7):
    df['c'] = df.apply(func, axis =1)
    print(df)

you can trace the origin of the values in the c column:

               a   b      c
2011-01-01  None   2      1    1
2011-01-02  None   3      3    3*1
2011-01-03  None  10      3    1*3*1
2011-01-04  None   3      9    3*1*3*1
2011-01-05  None   5     45    5*3*1*3*1
2011-01-06  None   8    360    ...
2011-01-07  None   4   1440    ...
2011-01-08  None   1   1440    ...
2011-01-09  None   2   2880    ...
2011-01-10  None   6  17280    6*2*4*8*5*3*3

You can clearly see the values are coming from a cumulative product. Each row is the value from the previous row multiplied by some new number. That new number sometimes comes from b, or is sometimes a 1 (when c is not NaN).

So if we can create a column d which has these "new" numbers in it, then the desired values can be computed by the cumprod:

df['c'] = df['d'].cumprod()

import pandas as pd
import numpy as np
import datetime

randn = np.random.randn

def setup_df():
    rng = pd.date_range('1/1/2011', periods=10, freq='D')
    df = pd.DataFrame({'a': [None] * 10, 'b': [2, 3, 10, 3, 5, 8, 4, 1, 2, 6]},
                      index=rng)
    df["c"] = np.NaN
    df.iloc[0, -1] = 1
    df.iloc[2, -1] = 3
    return df

df = setup_df()
df['d'] = df['b']
mask = pd.notnull(df['c'])
df.loc[mask, 'd'] = 1
df['c'] = df['d'].cumprod()
print(df)

yields

               a   b      c  d
2011-01-01  None   2      1  1
2011-01-02  None   3      3  3
2011-01-03  None  10      3  1
2011-01-04  None   3      9  3
2011-01-05  None   5     45  5
2011-01-06  None   8    360  8
2011-01-07  None   4   1440  4
2011-01-08  None   1   1440  1
2011-01-09  None   2   2880  2
2011-01-10  None   6  17280  6

I left the d column in to help show where the c values are coming from. You can of course delete the column with

del df['d']

Or better yet, as chrisaycock points out, you could forgo defining the d column altogether and instead use

df['c'] = np.where(pd.notnull(df['c']), 1, df['b']).cumprod()

This is very nice and should be accepted as the answer. If the OP wants a one-liner, it's simply `np.where(pd.notnull(df.c), 1, df.b).cumprod()`. — chrisaycock, Jun 06 '15 at 01:44
Could you kindly look at this question, it's quite similar. We just can't find a solution: http://stackoverflow.com/questions/30745160/portfolio-rebalancing-with-bandwidth-method-in-python — sorownas, Jun 12 '15 at 08:46

chrisaycock · Answer 3 · 2015-06-06T01:32:05.010

1

You can just write a write loop like this:

for i in range(1, len(df)):
    if pd.isnull(df.c[i]):
        df.c[i] = df.c[i-1] * df.b[i]

If this takes too long for you, you could jit it with numba. Your example DataFrame is too small for a meaningful test on my system though.

edited Jun 06 '15 at 01:32

answered Jun 06 '15 at 01:22

chrisaycock

36,470
14
88
125

hah, see my example...already done! note you cannot directly jit the way you have written it though. You cannot use things like ``pd.isnull``, nor the indexing operators; you have to use numpy arrays (and then wrap then at the end). – Jeff Jun 06 '15 at 01:32
@Jeff Ah, that makes sense. My test of jit actually came-out slower than vanilla Python, but I had thought that was just an issue of how small the DataFrame was. Thanks for the explanation. – chrisaycock Jun 06 '15 at 01:35
yes, it *would* work, but it will be in *python* mode (hence its not really translating to LLVM, etc.); you can force *nopython* mode, e.g. ``jit(nopython=True)`` to force it to raise if you do something that is not kosher :) – Jeff Jun 06 '15 at 01:38
@chrisaycock : could you kindly have a look at this. Its a similar problem: http://stackoverflow.com/questions/30745160/portfolio-rebalancing-with-bandwidth-method-in-python – sorownas Jun 12 '15 at 08:54

Automating fill with calculation function in python

3 Answers3