2

I'm trying to figure out an efficient split/apply/combine scheme for the following scenario. Consider the pandas dataframe demoAll defined below:

import datetime
import pandas as pd


demoA = pd.DataFrame({'date':[datetime.date(2010,1,1), datetime.date(2010,1,2), datetime.date(2010,1,3)],
                     'ticker':['A', 'A', 'A'],
                     'x1':[10,20,30],
                     'close':[120, 133, 129]}).set_index('date', drop=True)
demoB = pd.DataFrame({'date':[datetime.date(2010,1,1), datetime.date(2010,1,2), datetime.date(2010,1,3)],
                     'ticker':['B', 'B', 'B'],
                     'x1':[18,11,45],
                     'close':[50, 49, 51]}).set_index('date', drop=True)
demoAll = pd.concat([demoA, demoB])
print(demoAll)

The result is:

           ticker  x1  close
date                        
2010-01-01      A  10    120
2010-01-02      A  20    133
2010-01-03      A  30    129
2010-01-01      B  18     50
2010-01-02      B  11     49
2010-01-03      B  45     51

I also have a dictionary mapping of tickers to model objects

ticker2model = {'A':model_A, 'B':model_B,...}

where each model has a self.predict(df) method that takes-in an entire dataframe and returns a series of the same length.

I now would like to create a new column, demoAll['predictions'], that corresponds to these predictions. What is the cleanest/most-efficient way of doing this? A few things to note:

  1. demoAll was the concatenation of ticker-specific dataframes that were each indexed just by date. Thus the indices of demoAll are not unique. (However, the combination of date/ticker IS unique.)

  2. My thinking has been to do something like the example below, but running into issues with indexing, data-type coercions, and slow run times. The real dataset is quite large (both rows and columns).

    demoAll['predictions'] = demoAll.groupby('ticker').apply(
                               lambda x: ticker2model[x.name].predict(x)
                             )
    
Parfait
  • 104,375
  • 17
  • 94
  • 125
bigO6377
  • 1,256
  • 3
  • 14
  • 28
  • Instead of having a single groupby and running a function on it, you could refactor the predict function to work over multiple groupby's using pandas optimized DataFrame functions instead of a single function using apply. – Meow Dec 31 '19 at 00:34
  • 1
    I am a bit confused by your `predict` function: You are saying that it returns a series, and you assign the result of the groupby-and-apply operation as the new column. Can you provide some more details on the `predict` function or a minimal example for you demo data? – Gerd Jan 04 '20 at 11:30

1 Answers1

1

I may have misunderstood what you are passing through the model in order to predict but if I have understood correctly I would do the following:

  1. pre-allocate the predictions columns of the demoAll
  2. loop through the unique values of the ticker and filter demoAll
  3. filter out the ticker row
  4. predict the result using the filtered df
  5. save the results in the correct locations in demoAll['predictions']

An Example using your code:

# get non 'ticker' columns
non_ticker_cols = [col for col in demoAll.columns if col is not 'ticker']
# get unique set of tickers 
tickers = demoAll.ticker.unique()
# create and prepopulate the predictions column
demoAll['predictions'] = 0

for ticker in tickers:
    # get boolean Series to filter the Dataframes by.
    filter_by_ticker = demoAll.ticker == ticker
    # filter, predict and allocate 
    demoAll.loc[filter_by_ticker, 'predictions'] = ticker2model[
        ticker].predict(
        demoAll.loc[filter_by_ticker,
                    non_ticker_cols]
    )        

The output would look like:

         ticker x1  close   predictions
date                
2010-01-01  A   10  120 10.0
2010-01-02  A   20  133 10.0
2010-01-03  A   30  129 10.0
2010-01-01  B   18  50  100.0
2010-01-02  B   11  49  100.0
2010-01-03  B   45  51  100.0

Comparison to using apply

We could use apply per row but as you mentioned it would slow. I will compare the two to give an idea of the speedup.

Setup

I will use DummyRegressor from sklearn to allow me to call a predict method and create the dictionary you mention in your question.

model_a = DummyRegressor(strategy='mean')
model_b = DummyRegressor(strategy='median')

model_a.fit([[10,14]], y=np.array([10]))
model_b.fit([[200,200]], [100])
ticker2model = {'A':model_a, 'B':model_b}

Defining both as functions

def predict_by_ticker_filter(df, model_dict):
    # get non 'ticker' columns
    non_ticker_cols = [col for col in df.columns if col is not 'ticker']
    # get unique set of tickers 
    tickers = df.ticker.unique()
    # create and prepopulate the predictions column
    df['predictions'] = 0

    for ticker in tickers:
        # get boolean Series to filter the Dataframes by.
        filter_by_ticker = df.ticker==ticker
        # filter, predict and allocate 
        df.loc[filter_by_ticker,'predictions'] = model_dict[ticker].predict(
            df.loc[filter_by_ticker,
                   non_ticker_cols]
        )
    return df

def model_apply_by_row(df_row, model_dict):
    # includes some conversions to list to allow the predict method to run
    return model_dict[df_row['ticker']].predict([df_row[['x1','close']].tolist()])[0]

Performance I use timeit on the function call to give the following results

On your example demoAll:

model_apply_by_row

%timeit demoAll.apply(model_apply_by_row,model_dict=ticker2model, axis=1)

3.78 ms ± 227 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

predict_by_ticker_filter

%timeit predict_by_ticker_filter(demoAll, ticker2model)

6.24 ms ± 1.11 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

Increasing the size of demoAll to (606, 3):

model_apply_by_row

%timeit demoAll.apply(model_apply_by_row,model_dict=ticker2model, axis=1)

320 ms ± 28 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

predict_by_ticker_filter

%timeit predict_by_ticker_filter(demoAll, ticker2model)

6.1 ms ± 512 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Increasing the size of demoAll to (6006, 3):

model_apply_by_row

%timeit demoAll.apply(model_apply_by_row,model_dict=ticker2model, axis=1)

3.15 s ± 371 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

predict_by_ticker_filter

%timeit predict_by_ticker_filter(demoAll, ticker2model)

9.1 ms ± 767 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
B.C
  • 577
  • 3
  • 18