Runtime error using MLFlow and Spark on databricks

Question

Here is some model I created:

class SomeModel(mlflow.pyfunc.PythonModel):
    def predict(self, context, input):
        # do fancy ML stuff
        # log results
        pandas_df = pd.DataFrame(...insert predictions here...)
        spark_df = spark.createDataFrame(pandas_df)
        spark_df.write.saveAsTable('tablename', mode='append')

I'm trying to log my model in this manner by calling it later in my code:

with mlflow.start_run(run_name="SomeModel_run"):
    model = SomeModel()
    mlflow.pyfunc.log_model("somemodel", python_model=model)

Unfortunately it gives me this Error Message:

RuntimeError: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.

The error is caused because of the line mlflow.pyfunc.log_model("somemodel", python_model=model), if I comment it out my model will make its predictions and log the results in my table.

Alternatively, removing the lines in my predict function where I call spark to create a dataframe and save the table, I am able to log my model.

How do I go about resolving this issue? I need my model to not only write to the table but also be logged

Yeah, which is why we need to log the model to then register it — ghostiek, Jul 08 '22 at 15:31
Have you figured this out yet @ghostiek? I'm facing a similar issue — tyleroki, Jul 21 '22 at 05:40
Decided to just say screw registering models and instead creating a job using the notebook. Its more pricey though as you need a full compute cluster. Still open to finding a proper solution though. — ghostiek, Jul 22 '22 at 06:24
That is a shame. I'll give you an update if I find the solution. — tyleroki, Jul 22 '22 at 06:36
note that I did try also using DBFS and alter files but since it's hosted somewhere else I wasn't able to access the path. I think I also looked into connecting to Databricks API to send a request to read/write to a file to log results but that also fell through as I needed to create a .netrc file to login and there was no way I was going to put my databricks login/password on a registered model. [DBFS API Doc](https://docs.databricks.com/dev-tools/api/latest/dbfs.html) [Auth Doc](https://docs.databricks.com/dev-tools/api/latest/authentication.html) — ghostiek, Jul 22 '22 at 18:40

score 0 · Answer 1 · answered Oct 26 '22 at 19:19

I had a similar error and I was able to resolve this by taking spark functions like "spark.createDataFrame(pandas_df)" outside of the class. If you want to read-write data using spark do it within the main function.

    class SomeModel(mlflow.pyfunc.PythonModel):
       def predict(self, context, input):
       # do fancy ML stuff
       # log results
       return predictions

    with mlflow.start_run(run_name="SomeModel_run"):
      model = SomeModel()
      pandas_df = pd.DataFrame(...insert predictions here...)
      mlflow.pyfunc.log_model("somemodel", python_model=model)

This doesn't work if we use model serving. It will work in a notebook but otherwise this won't work if we use the model as an endpoint — ghostiek, Nov 16 '22 at 19:04

Runtime error using MLFlow and Spark on databricks

1 Answers1