Suppose I have many steps in my feature engineering: I would have many transformers in my pipeline. I am wondering how is Spark handling these transformers during the cross-validation of the pipeline: are they executed for each fold? Would it be faster to apply the transformers before cross-validating the model?
Which of these workflow would be the fastest (or is there a better solution)?:
1. Cross validator on pipeline
transformer1 = ...
transformer2 = ...
transformer3 = ...
lr = LogisticRegression(...)
pipeline = Pipeline(stages=[transformer1, transformer2, transformer3, lr])
crossval = CrossValidator(estimator=pipeline, numFolds=10, ...)
cvModel = crossval.fit(training)
prediction = cvModel.transform(test)
2. Cross validator after pipeline
transformer1 = ...
transformer2 = ...
transformer3 = ...
pipeline = Pipeline(stages=[transformer1, transformer2, transformer3])
training_trans = pipeline.fit(training).transform(training)
lr = LogisticRegression(...)
crossval = CrossValidator(estimator=lr, numFolds=10, ...)
cvModel = crossval.fit(training_trans)
prediction = cvModel.transform(test)
Finally, I have the same question with using caching: In 2. I could cache training_trans before doing my cross validation. In 1. I could use a Cacher
transformer in the pipeline before the LogisticRegression. (see Caching intermediate results in Spark ML pipeline for the Cacher)