How to combine scores from two models in Pyspark to create a new PipelineModel that is compatible with mleap?

Question

I want to build a custom PipelineModel transformer in Pyspark that is based on two pre-trained models (objects of PipelineModel). The objective is to feed data to model1 and model2, then get class-1 probabilities from both models, then multiply class-1 probability from each model with corresponding flag (i.e. with is_model1 or is_model2) and finally return all columns as output.

This is what I have implemented so far:

model1.stages[-1].setProbabilityCol("model1_probabilities")
model2.stages[-1].setProbabilityCol("model2_probabilities")

model1_slicer = VectorSlicer(inputCol="model1_probabilities", outputCol="model1_class1_probability", indices=[1])
model2_slicer = VectorSlicer(inputCol="model2_probabilities", outputCol="model2_class1_probability", indices=[1])

model1_interaction_assembler = VectorAssembler(inputCols=["is_model1", "model1_class1_probability"], outputCol="interaction_features_for_model1")
model2_interaction_assembler = VectorAssembler(inputCols=["is_model2", "model2_class1_probability"], outputCol="interaction_features_for_model2")

model1_interaction = Interaction(inputCols=["interaction_features_for_model1"], outputCol="model1_score")
model2_interaction = Interaction(inputCols=["interaction_features_for_model2"], outputCol="model2_score")

my_transformer = PipelineModel(
    stages=[
        model1,
        model2,
        model1_slicer,
        model2_slicer,
        model1_interaction_assembler,
        model2_interaction_assembler,
        model1_interaction,
        model2_interaction,
    ]
)

Few caveats: I cannot use custom transformers or sql based transformers as I want to serialize the model using Mleap (here is the list of transformers supported by mleap: https://combust.github.io/mleap-docs/core-concepts/transformers/support.html).

The problem I am facing right now is VectorSlicer returns a vector of single element, so when I use it in Interaction I get back a vector instead of multiplication values between flag and the probability score. The output looks something like this:

+-------------------------+
|model1_score|model2_score|
+-------------------------+
|[0.0,0.4404]|[1.0,0.5022]|
|[1.0,0.6686]|[0.0,0.7566]|
|[1.0,0.4676]|[0.0,0.4660]|
|[1.0,0.7589]|[0.0,0.7492]|
|[0.0,0.4275]|[1.0,0.5513]|
|[0.0,0.3982]|[1.0,0.6714]|
+-------------------------+

whereas the expected output here would be:

+-------------------------+
|model1_score|model2_score|
+-------------------------+
|0.0         |      0.5022|
|0.6686      |      0.0   |
|0.4676      |      0.0   |
|0.7589      |      0.0   |
|0.0         |      0.5513|
|0.0         |      0.6714|
+-------------------------+

How to combine scores from two models in Pyspark to create a new PipelineModel that is compatible with mleap?

0 Answers0