2

How can I share variables across spark streams in pyspark.

I'm trying to share a datafame that holds various values for a combination of features like example platform etc.

The program works once when the global variable is first initialized. It then crashes when the global variable is accessed by spark again in next loop.

here is my partial code

..initialize
globalDataFrame = None
sqlCountsDataFrame = None
stream = StreamingContext(sc, 30) # 30 second window
..connect to stream
..get data from stream

def process(time, rdd):
   global globalDataFrame
   spark = sqlContext
   ..process rdd in to rowRdd
   sqlDataFrame = spark.createDataFrame(rowrdd)
   sqlCountsDataFrame = spark.sql(...)   
   if globalDataFrame != None:     
        globalDataFrame=xyz.unionAll(sqlCountsDataFrame)

   if globalDataFrame == None:
        print "initializing globaldataframe"
        globalDataFrame =  sqlCountsDataFrame          
    print globalDataFrame.count()

parsed_data.foreachRDD(process)
stream.start()
dvshekar
  • 93
  • 11

0 Answers0