Idiomatic way to update pandas dataframe when computing features for new row

Question

I have a pandas dataframe with rows of time-series data.

I want to define a function compute_features(*args **kwargs) that I can use to compute some features (20+ cols) on an existing dataframe of time-series data (5 cols) for a machine learning model.

When using the model in a real-time application, I'll be receiving new rows of data (5 cols) -- which I'll have to compute the features on -- to then add to the dataframe. The slight problem is that some of these features are rolling and require the past N values in some of the columns.

Thus, I want to use the compute_features function to establish all my feature engineering logic and on the changing of a flag (let's say update=True) I can pass a pre-exisitng feature dataframe and a new row of data on which we will i) compute the features and ii) append the row.

I can think of 'fast and dirty' ways of accomplishing this, but I was wondering if there might be a more idiomatic, pythonic and less complex way of solving this problem.

EDIT:

For example:

import pandas as pd

path = ...
df = pd.read_csv(path)


def _compute_features_inner(df, new_row = None):
    # contains logic for computing features
    # here if there's a new row, we want to use the existing df and 
    # only run transforms on the final row
    df['feature_1'] = df['a'].rolling(window = 10)
    df['feature_2'] = df['a'].rolling(window = 20)
    ...

def compute_features(df, new_row: Optional[pd.DataFrame] = None) -> pd.DataFrame:
    if new_row is None:
         df = _compute_features_inner(df)
    else:
         df = _compute_features_inner(df, new_row)

    return df

And then in the real-time environment, we might get a new row every 15 minutes (thus we don't mind simply appending to the dataframe in this instance). I want to only run the compute_features transform for the last row, but you can't directly append the new_row as it doens't have the requisite columns.

I general you shouldn't iteratively append to a DataFrame, the complexity is bad. Can you give a minimal example of what you're doing with data, a dummy function and 2/3 iterations? — mozway, Jul 26 '23 at 14:14

Idiomatic way to update pandas dataframe when computing features for new row

0 Answers0