The workflow :
- to preprocess our raw data we use PySpark. We need to use Spark because of the size of the data.
- the PySpark preprocessing job uses a pipeline model that allows you to export your preprocessing logic to a file.
- by exporting the preprocessing logic via a pipeline model, you can load the pipeline model at inference time. Like this you don't need to code you preprocessing logic twice.
- at inference time, we would prefer to do the preprocessing step without a Spark context. The Spark Context is redundant at inference time, it slows down the time it takes to perform the inference.
i was looking at Mleap but this only supports the Scala language to do inference without a Spark context. Since we use PySpark it would be nice to stick to the Python language.
Question: What is a good alternative that lets you build a pipeline model in (Py)Spark at training phase and lets you reuse this pipeline model using the Python language without the need of a Spark context?