2

I'm using Azure Databricks + Hyperopt + MLflow for some hyperparameter tuning on a small dataset. Seem like the job is running, and I get output in MLflow, but the job ends with the following error message:

Hyperopt failed to execute mlflow.end_run() with tracking URI: databricks

Here is my code code with some information redacted:

from pyspark.sql import SparkSession

# spark session initialization
spark = (SparkSession.builder.getOrCreate())
sc = spark.sparkContext

# Data Processing
import pandas as pd
import numpy as np
# Hyperparameter Tuning
from hyperopt import fmin, tpe, hp, anneal, Trials, space_eval, SparkTrials, STATUS_OK
from sklearn.model_selection import RepeatedStratifiedKFold, cross_val_score
# Modeling
from sklearn.ensemble import RandomForestClassifier
# cleaning
import gc
# tracking
import mlflow
# track runtime
from datetime import date, datetime

mlflow.set_experiment('/user/myname/myexp')
# notebook settings \ variable settings
n_splits = #
n_repeats = #
max_evals = #

dfL = pd.read_csv("/my/data/loc/mydata.csv")

x_train = dfL[['f1','f2','f3']]
y_train = dfL['target']

def define_model(params):
    model = RandomForestClassifier(n_estimators=int(params['n_estimators']),
                                   criterion=params['criterion'], 
                                   max_depth=int(params['max_depth']), 
                                   min_samples_split=params['min_samples_split'], 
                                   min_samples_leaf=params['min_samples_leaf'], 
                                   min_weight_fraction_leaf=params['min_weight_fraction_leaf'], 
                                   max_features=params['max_features'], 
                                   max_leaf_nodes=None, 
                                   min_impurity_decrease=params['min_impurity_decrease'], 
                                   min_impurity_split=None, 
                                   bootstrap=params['bootstrap'], 
                                   oob_score=False, 
                                   n_jobs=-1, 
                                   random_state=int(params['random_state']), 
                                   verbose=0, 
                                   warm_start=False, 
                                   class_weight={0:params['class_0_weight'], 1:params['class_1_weight']})
        return model


space = {'n_estimators': hp.quniform('n_estimators', #, #, #),
         'criterion': hp.choice('#', ['#','#']),
         'max_depth': hp.quniform('max_depth', #, #, #),
         'min_samples_split': hp.quniform('min_samples_split', #, #, #),
         'min_samples_leaf': hp.quniform('min_samples_leaf', #, #, #),
         'min_weight_fraction_leaf': hp.quniform('min_weight_fraction_leaf', #, #, #),
         'max_features': hp.quniform('max_features', #, #, #),
         'min_impurity_decrease': hp.quniform('min_impurity_decrease', #, #, #),
         'bootstrap': hp.choice('bootstrap', [#,#]),
         'random_state': hp.quniform('random_state', #, #, #),
         'class_0_weight': hp.choice('class_0_weight', [#,#,#]),
         'class_1_weight': hp.choice('class_1_weight', [#,#,#])}

# define hyperopt objective
def objective(params, n_splits=n_splits, n_repeats=n_repeats):

    # define model
    model = define_model(params)
    # get cv splits
    kfold = RepeatedStratifiedKFold(n_splits=n_splits, n_repeats=n_repeats, random_state=1331)
    # define and run sklearn cv scorer
    scores = cross_val_score(model, x_train, y_train, cv=kfold, scoring='roc_auc')
    score = scores.mean()

    return {'loss': score*(-1), 'status': STATUS_OK}

spark_trials = SparkTrials(parallelism=36, spark_session=spark)
with mlflow.start_run():
  best = fmin(objective, space, algo=tpe.suggest, trials=spark_trials, max_evals=max_evals)

and then at the end I get..

100%|██████████| 200/200 [1:35:28<00:00, 100.49s/trial, best loss: -0.9584565527065526]

Hyperopt failed to execute mlflow.end_run() with tracking URI: databricks

Exception: 'MLFLOW_RUN_ID'

Total Trials: 200: 200 succeeded, 0 failed, 0 cancelled.

My Azure Databricks cluster is..

6.6 ML (includes Apache Spark 2.4.5, Scala 2.11)
Standard_DS3_v2
min 9 max 18 nodes

Am I doing something wrong or is this a bug?

yeamusic21
  • 276
  • 3
  • 11

1 Answers1

2

This message is a known (but harmless) issue and has been fixed for MLR 7.0. I have tried executing on the DBR 7.0 ML cluster and it's working.

You don’t need start_run(); a run is started for you automatically with SparkTrials. The error is because of this only.

So with SparkTrials, it still works without start_run(); SparkTrials should automatically run and log for you.

Jeremy Caney
  • 7,102
  • 69
  • 48
  • 77
  • Thanks so much for your response! Unfortunately if I remove `start_run()` then nothing is saved to MLflow, but I guess this is because I don't have MLR 7.0? What is MLR 7.0? Also, this doesn't appear in-line with the databricks documentation: https://docs.databricks.com/applications/machine-learning/automl/hyperopt/hyperopt-spark-mlflow-integration.html – yeamusic21 Jun 12 '20 at 16:01
  • 1
    MLR 7.0 is Databricks runtime 7.0 ML, please select this runtime while creating cluster – Shyamprasad Reddy Jul 07 '20 at 19:08