How to use pipeline for Apache Spark jobs?

Question

I’m learning how to use kubeflow pipeline for Apache Spark jobs and have a question. I’d appreciate if you could share your thoughts!

It is my understanding that data cannot be shared between SparkSessions, and that in each pipeline step/component you need to instantiate a new SparkSession (please correct me if I’m wrong). Does that mean in order to use output from spark jobs from previous pipeline steps, we need to save it somewhere? I suspect this will cause disk read/write burden and slow down the whole process. Can you please share with me how helpful it will be then to use pipeline for spark work?

I’m imaging a potential use case where one would like to ingest data in pyspark, preprocess it, select features for a ML job, then try different ML models and select the best one. In a non-spark situation, I probably would set separate components for each step of “loading data”, “preprocessing data”, and “feature engineering”. Due to the aforementioned issue, however, would it be better to complete all these with in one step in the pipeline, save the output somewhere, and then dedicate separate pipeline components for each model and train them in parallel?

Can you share any other potential use case? Thanks a lot in advance!

score 0 · Answer 1 · answered Mar 23 '22 at 09:36

Spark in general is a in-memory processing framework, you'd avoid un-necessary writing/reading files. I believe it's better to have one spark job done in one task so you don't need to share spark session and the "middle" result between tasks. Data from loading data/pre-processing/feature engineering better to be serialised/stored with/without kubeflow anyway (think silver/bronze/golden).

How to use pipeline for Apache Spark jobs?

1 Answers1