4

Here is some model I created:

class SomeModel(mlflow.pyfunc.PythonModel):
    def predict(self, context, input):
        # do fancy ML stuff
        # log results
        pandas_df = pd.DataFrame(...insert predictions here...)
        spark_df = spark.createDataFrame(pandas_df)
        spark_df.write.saveAsTable('tablename', mode='append')

I'm trying to log my model in this manner by calling it later in my code:

with mlflow.start_run(run_name="SomeModel_run"):
    model = SomeModel()
    mlflow.pyfunc.log_model("somemodel", python_model=model)

Unfortunately it gives me this Error Message:

RuntimeError: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.

The error is caused because of the line mlflow.pyfunc.log_model("somemodel", python_model=model), if I comment it out my model will make its predictions and log the results in my table.

Alternatively, removing the lines in my predict function where I call spark to create a dataframe and save the table, I am able to log my model.

How do I go about resolving this issue? I need my model to not only write to the table but also be logged

ghostiek
  • 45
  • 5
  • Why do you need to write from the model into a table? – Alex Ott Jul 07 '22 at 18:04
  • we want to log the results of our ML algo – ghostiek Jul 07 '22 at 19:54
  • How do you invoke your model? Rest api? – Alex Ott Jul 07 '22 at 20:50
  • Yeah, which is why we need to log the model to then register it – ghostiek Jul 08 '22 at 15:31
  • Have you figured this out yet @ghostiek? I'm facing a similar issue – tyleroki Jul 21 '22 at 05:40
  • Decided to just say screw registering models and instead creating a job using the notebook. Its more pricey though as you need a full compute cluster. Still open to finding a proper solution though. – ghostiek Jul 22 '22 at 06:24
  • 1
    That is a shame. I'll give you an update if I find the solution. – tyleroki Jul 22 '22 at 06:36
  • note that I did try also using DBFS and alter files but since it's hosted somewhere else I wasn't able to access the path. I think I also looked into connecting to Databricks API to send a request to read/write to a file to log results but that also fell through as I needed to create a .netrc file to login and there was no way I was going to put my databricks login/password on a registered model. [DBFS API Doc](https://docs.databricks.com/dev-tools/api/latest/dbfs.html) [Auth Doc](https://docs.databricks.com/dev-tools/api/latest/authentication.html) – ghostiek Jul 22 '22 at 18:40

1 Answers1

0

I had a similar error and I was able to resolve this by taking spark functions like "spark.createDataFrame(pandas_df)" outside of the class. If you want to read-write data using spark do it within the main function.

    class SomeModel(mlflow.pyfunc.PythonModel):
       def predict(self, context, input):
       # do fancy ML stuff
       # log results
       return predictions

    with mlflow.start_run(run_name="SomeModel_run"):
      model = SomeModel()
      pandas_df = pd.DataFrame(...insert predictions here...)
      mlflow.pyfunc.log_model("somemodel", python_model=model)
S J
  • 1
  • This doesn't work if we use model serving. It will work in a notebook but otherwise this won't work if we use the model as an endpoint – ghostiek Nov 16 '22 at 19:04