Spark ML models not able to deploy on Databricks inference

Question

I'm trying to deploy the spark models(sparkxgbregressor, rfregressor) in databricks. Is model inferencing available for ONLY scikit learn models? If yes, is there any other way to deploy spark models in databricks?

As per the ask, adding code for reference:(This code runs fine and logs the last run model instead of best, but has the following warning:

WARNING mlflow.pyspark.ml: Model PipelineModel_f*******6 will not be autologged because it is not allowlisted or or because one or more of its nested models are not allowlisted. Call mlflow.spark.log_model() to explicitly log the model, or specify a custom allowlist via the spark.mlflow.pysparkml.autolog.logModelAllowlistFile Spark conf (see mlflow.pyspark.ml.autolog docs for more info).

#-------------------------------------------------------XGBOost-------------------------------------------------------------------------
#train_df=train_df.limit(188123)
from pyspark.ml.feature import StringIndexer, VectorAssembler
from pyspark.ml import Pipeline
from xgboost.spark import SparkXGBRegressor
from pyspark.ml.evaluation import RegressionEvaluator
import numpy as np
from mlflow.models.signature import infer_signature
from hyperopt import hp
#vec_assembler = VectorAssembler(inputCols=train_df.columns[1:], outputCol="features")

xgb = SparkXGBRegressor(num_workers=1, label_col="price", missing=0.0)

pipeline = Pipeline(stages=[ordinal_encoder, vec_assembler, xgb])
regression_evaluator = RegressionEvaluator(predictionCol="prediction", labelCol="price")

def objective_function(params):    
    # set the hyperparameters that we want to tune
    max_depth = params["max_depth"]
    n_estimators = params["n_estimators"]

    with mlflow.start_run():
        estimator = pipeline.copy({xgb.max_depth: max_depth, xgb.n_estimators: n_estimators})
        model = estimator.fit(train_df)

        preds = model.transform(test_df)
        rmse = regression_evaluator.evaluate(preds)
        mlflow.log_metric("rmse", rmse)

    return rmse


search_space = {
    "max_depth" : hp.choice('max_depth', np.arange(5, 15, dtype=int)),
     "n_estimators": hp.choice('n_estimators', np.arange(50, 80, dtype=int))
}

from hyperopt import fmin, tpe, Trials
import numpy as np
import mlflow
import mlflow.spark
import mlflow.sklearn
mlflow.pyspark.ml.autolog(log_models=True)


num_evals = 1
trials = Trials()
best_hyperparam = fmin(fn=objective_function, 
                       space=search_space,
                       algo=tpe.suggest, 
                       max_evals=num_evals,
                       trials=trials,
                       rstate=np.random.default_rng(42))

# Retrain model on train & validation dataset and evaluate on test dataset
with mlflow.start_run():
    best_max_depth = best_hyperparam["max_depth"]
    best_n_estimators = best_hyperparam["n_estimators"]
    estimator = pipeline.copy({xgb.max_depth: best_max_depth, xgb.n_estimators: best_n_estimators})
    combined_df = train_df.union(test_df) # Combine train & validation together

    pipeline_model = estimator.fit(combined_df)
    pred_df = pipeline_model.transform(test_df)
    #signature = infer_signature(train_df, pred_df)
    rmse = regression_evaluator.evaluate(pred_df)

    # Log param and metrics for the final model
    mlflow.log_param("maxdepth", best_max_depth)
    mlflow.log_param("n_estimators", best_n_estimators)
    mlflow.log_metric("rmse", rmse)
    mlflow.spark.log_model(pipeline_model, "model",input_example=test_df.select(old_cols_list).limit(1).toPandas())

What approach you tried ? Provide your code for these models. — JayashankarGS, Jul 13 '23 at 05:38
Do you get an error? SparkML models should work mostly, but there is a big overhead — Alex Ott, Jul 13 '23 at 06:39

score 0 · Answer 1 · answered Jul 13 '23 at 13:33

You use below approach for deploying your model.

First, log the model within mlflow run context as below.

with mlflow.start_run():
    mlflow.spark.log_model(xgb_reg_model, "xgb-model")

This will create the runs and logs the model in the xgb-model. You will get runs in experiment as shown in below.

enter image description here

After, you can predict it using the Run Id you got above.

import mlflow
logged_model = 'runs:/c11e55e6a57c492a8046001114f729e8/xgb-model'
loaded_model = mlflow.spark.load_model(logged_model)
display(loaded_model.transform(df_test))

In the above code, you load the model using Run Id, and df_test is the dataframe consists a column features of vector type.

enter image description here

Next, register the model as below.

logged_model = 'runs:/c11e55e6a57c492a8046001114f729e8/xgb-model'
mlflow.register_model(logged_model,'xgb-model-1')

enter image description here

And In Models it is registered.

enter image description here

Next, go to serving endpoint and create new one with the above registered model.

enter image description here

Here, give serving endpoint name and registered model. After creating the endpoint, query it with sample data using the url given.

Have added the coded, it seems to be the same as you have mentioned, but still not working — Rayzee, Jul 18 '23 at 10:21
It logs the last run, which shouldnt be happening. instead it should log the best run — Rayzee, Jul 18 '23 at 10:50

Spark ML models not able to deploy on Databricks inference

1 Answers1