fit in distributed, predict in a stand alone

Question

How can one train (fit) a model in a distributed big data platform (e.g Apache Spark) yet use that model in a stand alone machine (e.g. JVM) with as little dependency as possible?

I heard of PMML yet I am not sure if it is enough. Also Spark 2.0 supports persistent model saving yet I am not sure what is necessary to load and run those models.

score 2 · Answer 1 · answered Aug 19 '16 at 17:05

Apache Spark persistence is about saving and loading Spark ML pipelines in JSON data format (think of it as Python's pickle mechanism, or R's RDS mechanism). These JSON data structures map to Spark ML classes. They don't make sense on other platforms.

As for PMML, then you can convert Spark ML pipelines to PMML documents using the JPMML-SparkML library. You can execute PMML documents (doesn't matter whether they came from Apache Spark, Python or R) using the JPMML-Evaluator library. If you're using Apache Maven to manage and build your project, then JPMML-Evaluator can be included by adding just one dependency declaration to your project's POM.

fit in distributed, predict in a stand alone

1 Answers1