3

I'm building multiple Prophet models where each model is passed to a pandas_udf function which trains the model and stores the results with MLflow.

@pandas_udf(result_schema, PandasUDFType.GROUPED_MAP)
def forecast(data):
......
   with mlflow.start_run() as run: 
......

Then I call this UDF which trains a model for each KPI.

df.groupBy('KPI').apply(forecast)

The idea is that, for each KPI a model will be trained with multiple hyperparameters and store the best params for each model in MLflow. I would like to use Hyperopt to make the search more efficient.

In this case, where should I place the objective function? Since the data is passed to the UDF for each model I thought of creating an inner function within the UDF that uses the data for each run. Does this make sense?

nescobar
  • 81
  • 1
  • 5

1 Answers1

1

if I remember correctly, you couldn't do it because it would be something like nested Spark execution, and it won't work with Spark. You'll need to have to change approach to something like:

for kpi in list_of_kpis:
  run_hyperopt_tuning

if you need to tune parameters for every KPI model separately - because it will optimize parameters separately.

If KPI is like a hyperparameter of the model, then you can just include list of KPIs into search space, and load necessary data inside the function that doing the training & evaluation.

Alex Ott
  • 80,552
  • 8
  • 87
  • 132