1

I am working on a use case where I have to process a huge amount of data (multiple tables) and I am trying to submit this as a batch job to the Dataproc cluster(PySpark).

My code looks something like this

from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import SparkSession

def readconfig():
   #code to read a yaml file

def func(filename, tabname):
   sc = SparkContext("local", "First App")
   sqlContext = SQLContext(sc)
   spark = SparkSession.builder.getOrCreate()
   df1= read from file-filename as rdd using sqlcontext
   df2= read from bigquery-tabname as df using spark
   .
   op = intermediate processing
   .
   #caching and unpersisting 2 dfs 
   .
   op.write.csv(write multiple files in gcs bucket)
   sc.stop()
   spark.stop()
   print("one pair of table and file processed")

if __name__ == "__main__":
   config= readconfig()
   for i,j in config.items():
      func(i,j):

As the file sizes are huge, I am trying to create a separate SparkSession for each of the pair of file and table being processed. It works fine and I was able to process a good number of tables. Later I started to receiving warning for memory issues with node and finally an error saying:

node has insufficient resources. Could not create SparkSession.

Why is this happening when closing a SparkSession should relieve the memory of the data from previous iteration?

Igor Dvorzhak
  • 4,360
  • 3
  • 17
  • 31
Nightwing
  • 37
  • 6

1 Answers1

0

Because you are passing a local value to a master parameter in SparkContext constructor you are running your application in a local deployment mode on a single VM (Dataproc master node). That's why you can not process large amount of data in your application.

To fix this issue you should use parametless SparkContext() constructor that will load parameters from properties configured by Dataproc - in this case your application will run on YARN when you will submit it to Dataproc cluster and will be able to utilize all Dataproc cluster resources/nodes.

Also, you may want to refactor your application to do data processing for all tables in a single SparkSession instead of creating per-table SparkSession - this should more efficient and scalable if done right.

Igor Dvorzhak
  • 4,360
  • 3
  • 17
  • 31