I want to build a custom PipelineModel
transformer in Pyspark that is based on two pre-trained models (objects of PipelineModel
). The objective is to feed data to model1
and model2
, then get class-1 probabilities from both models, then multiply class-1 probability from each model with corresponding flag (i.e. with is_model1
or is_model2
) and finally return all columns as output.
This is what I have implemented so far:
model1.stages[-1].setProbabilityCol("model1_probabilities")
model2.stages[-1].setProbabilityCol("model2_probabilities")
model1_slicer = VectorSlicer(inputCol="model1_probabilities", outputCol="model1_class1_probability", indices=[1])
model2_slicer = VectorSlicer(inputCol="model2_probabilities", outputCol="model2_class1_probability", indices=[1])
model1_interaction_assembler = VectorAssembler(inputCols=["is_model1", "model1_class1_probability"], outputCol="interaction_features_for_model1")
model2_interaction_assembler = VectorAssembler(inputCols=["is_model2", "model2_class1_probability"], outputCol="interaction_features_for_model2")
model1_interaction = Interaction(inputCols=["interaction_features_for_model1"], outputCol="model1_score")
model2_interaction = Interaction(inputCols=["interaction_features_for_model2"], outputCol="model2_score")
my_transformer = PipelineModel(
stages=[
model1,
model2,
model1_slicer,
model2_slicer,
model1_interaction_assembler,
model2_interaction_assembler,
model1_interaction,
model2_interaction,
]
)
Few caveats: I cannot use custom transformers or sql based transformers as I want to serialize the model using Mleap (here is the list of transformers supported by mleap: https://combust.github.io/mleap-docs/core-concepts/transformers/support.html).
The problem I am facing right now is VectorSlicer returns a vector of single element, so when I use it in Interaction I get back a vector instead of multiplication values between flag and the probability score. The output looks something like this:
+-------------------------+
|model1_score|model2_score|
+-------------------------+
|[0.0,0.4404]|[1.0,0.5022]|
|[1.0,0.6686]|[0.0,0.7566]|
|[1.0,0.4676]|[0.0,0.4660]|
|[1.0,0.7589]|[0.0,0.7492]|
|[0.0,0.4275]|[1.0,0.5513]|
|[0.0,0.3982]|[1.0,0.6714]|
+-------------------------+
whereas the expected output here would be:
+-------------------------+
|model1_score|model2_score|
+-------------------------+
|0.0 | 0.5022|
|0.6686 | 0.0 |
|0.4676 | 0.0 |
|0.7589 | 0.0 |
|0.0 | 0.5513|
|0.0 | 0.6714|
+-------------------------+