How do I write custom MLflow friendly MLlib Classes?

Question

I'm trying to write a few custom classes to work with the existing MLlib codebase and MLflow on Databricks. For example, write a transformer, estimator, or extend an existing MLlib class and be able to add to a pipeline, fit it (if necessary), log it to mlflow and serve it.

Does anyone have experience writing custom MLlib classes that can be used with MLflow, and could help me?

I can create transformers, estimators and what not, and even extending MLlib classes but I always get the following warning when logging it into an MLflow run:

 2023/02/27 02:37:31 WARNING mlflow.utils.environment: Encountered an  unexpected error while inferring pip requirements (model URI:  /tmp/tmpjpcb462k, flavor: spark), fall back to return  ['pyspark==3.3.0']. Set logging level to DEBUG to see the full  traceback. Out[7]: <mlflow.models.model.ModelInfo at 0x7f5ea18f9610>

It still works fine if I load the model from a notebook, after doing a "%run " , but won't work if I serve the model and use the REST endpoint, for example.

Is there an "official" way of doing this? Can somebody please help me work this out?

Cheers,

Toy example:

from pyspark.ml.evaluation import Evaluator
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator, CrossValidatorModel
from pyspark.ml import Transformer, Model
from pyspark.ml.param import Param, Params
from typing import List, Sequence, Callable, Tuple, Optional, cast
from multiprocessing.pool import ThreadPool
from pyspark import inheritable_thread_target, keyword_only
from pyspark.sql import DataFrame
import numpy as np

class ValueRounder(Transformer, HasInputCol, HasOutputCol, DefaultParamsReadable, DefaultParamsWritable):
  @keyword_only
  def __init__(self, inputCol=None, outputCol=None):
    super(ValueRounder, self).__init__()
    kwargs = self._input_kwargs
    self.setParams(**kwargs)

  @keyword_only
  def setParams(self, inputCol=None, outputCol=None):
    kwargs = self._input_kwargs
    return self._set(**kwargs)

  def setInputCol(self, value):
    return self._set(inputCol=value)

  def setOutputCol(self, value):
    return self._set(outputCol=value)

  def _transform(self, dataset):
    return dataset.withColumn(self.getOutputCol(), spark_round(self.getInputCol()))

df = spark.createDataFrame(
    [
        (1.0, 2, None), 
        (1.2, None, 3), 
        (None, 2, None), 
        (1.5, None, 3), 
        (1.7, 2, 3)
    ], 
    ['A', 'B', 'C'] 
)
myAss = VectorAssembler(inputCols=['A', 'B', 'C'], outputCol='features', handleInvalid='keep')
myRounder = ValueRounder(inputCol='A', outputCol='rounded(A)')
model = Pipeline(stages=[myAss, myRounder]).fit(df)
# the following will spit out the aforementioned warning
mlflow.spark.log_model(model, artifact_path='model')

How do I write custom MLflow friendly MLlib Classes?

0 Answers0