5

The workflow :

  • to preprocess our raw data we use PySpark. We need to use Spark because of the size of the data.
  • the PySpark preprocessing job uses a pipeline model that allows you to export your preprocessing logic to a file.
  • by exporting the preprocessing logic via a pipeline model, you can load the pipeline model at inference time. Like this you don't need to code you preprocessing logic twice.
  • at inference time, we would prefer to do the preprocessing step without a Spark context. The Spark Context is redundant at inference time, it slows down the time it takes to perform the inference.

i was looking at Mleap but this only supports the Scala language to do inference without a Spark context. Since we use PySpark it would be nice to stick to the Python language.

Question: What is a good alternative that lets you build a pipeline model in (Py)Spark at training phase and lets you reuse this pipeline model using the Python language without the need of a Spark context?

Vincent Claes
  • 3,960
  • 3
  • 44
  • 62

0 Answers0