Spark - Reload saved Featurization Pipeline vs instantiate new Pipeline with same stages

Question

I would like to check if I'm missing any important points here.

My pipeline is only for Featurization. I understand that once a pipeline that includes an Estimator is fitted; then saving the pipeline will persist the params the Estimator has learned. So loading a saved pipeline in this case means not having to re-train the Estimator; which is a huge point.

However; for the case of a pipeline which only consists of a number of Transform stages; would I always get the same result on feature extraction from a input dataset using either of the below two approaches?

1)

Creating a pipeline with a certain set of stages; and configuration per stage.
Saving and reloading the pipeline.
Transforming an input dataset

versus

2)

Each time just instantiating a new pipeline (of course with the exact same set of stages; and configuration per stage)
Transforming the input dataset

So; alternative phrasing would be; as long as the exact set of stages; and configuration per stage is known; a Featurization pipeline can be efficiently (because there is no 'training an estimator' phase) recreated without using save or load?

Thanks, Brent

I don't believe that loading a Estimator can be slipped into the Pipeline for time being. — eliasah, Aug 18 '16 at 15:54
Just for the record do you think about `Pipeline` or `PipelineModel`? — zero323, Aug 18 '16 at 16:15

Spark - Reload saved Featurization Pipeline vs instantiate new Pipeline with same stages

0 Answers0