2

I'm currently working in Python building a moderately complex application that relies on stateful data from multiple sources. With Pyspark I've run into an issue where a global variable used within an updateStateByKey function isn't being assigned after the application restarts from a checkpoint. Using forEachRDD I have a global variable A that is propagated from a file every time a batch runs, and A is then used in an updateStateByKey. When I initially run the application, it functions as expected and the value of A is referenced correctly within the scope of the update function.

However, when I bring the application down and restart, I see a different behavior. Variable A is assigned the correct value by its corresponding forEachRDD function, but when the updateStateByKey function is executed the new value for A isn't used. It just... disappears.

I could be going about the implementation of this wrong, but I'm hoping that someone can point me in the correct direction.

Here's some pseudocode:

def readfile(rdd):
    global A
    A = readFromFile

def update(new, old)
    if old in A:
        do something

dstream.forEachRDD(readfile)
dstream.updateStateByKey(update)

ssc.checkpoint('checkpoint')

A is correct the first time this is run, but when the application is killed and restarted A doesn't seem to be reassigned correctly.

JoeP
  • 21
  • 3

0 Answers0