1

Being new to Spark, I need to read data from MySQL DB, and then update(or upsert) rows in another table based on what I've read.

AFAIK, unfortunately, there's no way I can do update with DataFrameWriter, so I want to try querying directly to the DB after/while iterating over partitions.

For now I'm writing a script and testing with local gluepyspark shell, Spark version 3.1.1-amzn-0.

from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession

sc = SparkContext.getOrCreate()
spark = SparkSession(sc)

def f(p):
    pass

sc.parallelize([1, 2, 3, 4, 5]).foreachPartition(lambda p: f(p))

When I try to import this simple code in gluepyspark shell, it raises errors saying "SparkContext should only be created and accessed on the driver."

However, there are some conditions under which it works.

  • It works if I run the script via gluesparksubmit.
  • It works if I use lambda expression instead of function declaration.
  • It works if I declare a function within REPL and pass it as argument.
  • It does not work if I put both def func(): () and .foreachPartition(func) call in the same script.
  • Moving the function declaration to another module also seems to work. But this couldn't be an option for I need to pack things in one job script.

Could anyone please help me understand:

  • why the error is thrown
  • why the error is NOT thrown in other cases

Complete error log: https://justpaste.it/37tj6

fracsinus
  • 11
  • 1

0 Answers0