I have a pandas dataframe with rows of time-series data.
I want to define a function compute_features(*args **kwargs)
that I can use to compute some features (20+ cols) on an existing dataframe of time-series data (5 cols) for a machine learning model.
When using the model in a real-time application, I'll be receiving new rows of data (5 cols) -- which I'll have to compute the features on -- to then add to the dataframe. The slight problem is that some of these features are rolling and require the past N values in some of the columns.
Thus, I want to use the compute_features
function to establish all my feature engineering logic and on the changing of a flag (let's say update=True
) I can pass a pre-exisitng feature dataframe and a new row of data on which we will i) compute the features and ii) append the row.
I can think of 'fast and dirty' ways of accomplishing this, but I was wondering if there might be a more idiomatic, pythonic and less complex way of solving this problem.
EDIT:
For example:
import pandas as pd
path = ...
df = pd.read_csv(path)
def _compute_features_inner(df, new_row = None):
# contains logic for computing features
# here if there's a new row, we want to use the existing df and
# only run transforms on the final row
df['feature_1'] = df['a'].rolling(window = 10)
df['feature_2'] = df['a'].rolling(window = 20)
...
def compute_features(df, new_row: Optional[pd.DataFrame] = None) -> pd.DataFrame:
if new_row is None:
df = _compute_features_inner(df)
else:
df = _compute_features_inner(df, new_row)
return df
And then in the real-time environment, we might get a new row every 15 minutes (thus we don't mind simply appending to the dataframe in this instance). I want to only run the compute_features
transform for the last row, but you can't directly append the new_row
as it doens't have the requisite columns.