I'm currently working in Python building a moderately complex application that relies on stateful data from multiple sources. With Pyspark
I've run into an issue where a global variable used within an updateStateByKey
function isn't being assigned after the application restarts from a checkpoint. Using forEachRDD
I have a global variable A
that is propagated from a file every time a batch runs, and A
is then used in an updateStateByKey
. When I initially run the application, it functions as expected and the value of A
is referenced correctly within the scope of the update function.
However, when I bring the application down and restart, I see a different behavior. Variable A
is assigned the correct value by its corresponding forEachRDD
function, but when the updateStateByKey
function is executed the new value for A
isn't used. It just... disappears.
I could be going about the implementation of this wrong, but I'm hoping that someone can point me in the correct direction.
Here's some pseudocode:
def readfile(rdd):
global A
A = readFromFile
def update(new, old)
if old in A:
do something
dstream.forEachRDD(readfile)
dstream.updateStateByKey(update)
ssc.checkpoint('checkpoint')
A
is correct the first time this is run, but when the application is killed and restarted A
doesn't seem to be reassigned correctly.