0

I am new to pySpark and I'm trying to implement a multi-step EMR/Spark job using MRJob, do I need to create a new SparkContext for each SparkStep, or can I share the same SparkContext for all SparkSteps?

I tried to look up the MRJob manual but unfortunately it was not clear on this.

Can someone please advise what's the correct approach?

  1. Creating a separate SparkContext:

    class MRSparkJob(MRJob):
        def spark_step1(self, input_path, output_path):
            from pyspark import SparkContext
            sc = SparkContext(appName='appname')
            ...
            sc.stop()
    
        def spark_step2(self, input_path, output_path):
            from pyspark import SparkContext
            sc = SparkContext(appName='appname')
            ...
            sc.stop()
    
        def steps(self):
            return [SparkStep(spark=self.spark_step1),
                    SparkStep(spark=self.spark_step2)]
    
    if __name__ == '__main__':
        MRSparkJob.run()
    
  2. Create a single SparkContext and share it among differnt SparkSteps

    class MRSparkJob(MRJob):
    
        sc = None
    
        def spark_step1(self, input_path, output_path):
            from pyspark import SparkContext
            self.sc = SparkContext(appName='appname')
            ...
    
    
        def spark_step2(self, input_path, output_path):
            from pyspark import SparkContext
    
            ... (reuse the same self.sc)
            self.sc.stop()
    
        def steps(self):
            return [SparkStep(spark=self.spark_step1),
                    SparkStep(spark=self.spark_step2)]
    
    if __name__ == '__main__':
        MRSparkJob.run()
    
vkc
  • 556
  • 2
  • 8
  • 18

1 Answers1

0

According to Dave at MRJob discussion group, we should create a new SparkContext for each step, as each step is a completely new invocation of Hadoop and Spark (ie. #1 above is the correct approach).

vkc
  • 556
  • 2
  • 8
  • 18