I am new to pySpark and I'm trying to implement a multi-step EMR/Spark job using MRJob, do I need to create a new SparkContext for each SparkStep, or can I share the same SparkContext for all SparkSteps?
I tried to look up the MRJob manual but unfortunately it was not clear on this.
Can someone please advise what's the correct approach?
Creating a separate SparkContext:
class MRSparkJob(MRJob): def spark_step1(self, input_path, output_path): from pyspark import SparkContext sc = SparkContext(appName='appname') ... sc.stop() def spark_step2(self, input_path, output_path): from pyspark import SparkContext sc = SparkContext(appName='appname') ... sc.stop() def steps(self): return [SparkStep(spark=self.spark_step1), SparkStep(spark=self.spark_step2)] if __name__ == '__main__': MRSparkJob.run()
Create a single SparkContext and share it among differnt SparkSteps
class MRSparkJob(MRJob): sc = None def spark_step1(self, input_path, output_path): from pyspark import SparkContext self.sc = SparkContext(appName='appname') ... def spark_step2(self, input_path, output_path): from pyspark import SparkContext ... (reuse the same self.sc) self.sc.stop() def steps(self): return [SparkStep(spark=self.spark_step1), SparkStep(spark=self.spark_step2)] if __name__ == '__main__': MRSparkJob.run()