How to pass psycopg2 cursor object to foreachPartition() in pyspark?

Question

I'm getting following error

Traceback (most recent call last):
  File "/databricks/spark/python/pyspark/serializers.py", line 473, in dumps
    return cloudpickle.dumps(obj, pickle_protocol)
  File "/databricks/spark/python/pyspark/cloudpickle/cloudpickle_fast.py", line 73, in dumps
    cp.dump(obj)
  File "/databricks/spark/python/pyspark/cloudpickle/cloudpickle_fast.py", line 563, in dump
    return Pickler.dump(self, obj)
TypeError: cannot pickle 'psycopg2.extensions.cursor' object
PicklingError: Could not serialize object: TypeError: cannot pickle 'psycopg2.extensions.cursor' object

while running the below script

def get_connection():
    conn_props = brConnect.value
    print(conn_props)
    #extract value from broadcast variables
    database = conn_props.get("database")
    user = conn_props.get("user")
    pwd = conn_props.get("password")
    host = conn_props.get("host") 
    db_conn = psycopg2.connect(
                host = host,
                user = user,
                password = pwd,
                database = database,
                port = 5432
                )
    return db_conn

def process_partition_up(partition, db_cur):
    updated_rows = 0
    try:
        for row in partition:
            process_row(row, myq, db_cur)
    
    except Exception as e:
        print("Not connected")
   
    return updated_rows

def update_final(df, db_cur):
    df.rdd.coalesce(2).foreachPartition(lambda x: process_partition_up(x, db_cur))

def etl_process():
    for id in ['003']:
        conn = get_connection()
        for t in ['email_table']:        
            query = f'''(select * from public.{t} where id= '{id}') as tab'''
            df_updated = load_data(query)
            if df_updated.count() > 0:
                q1 = insert_ops(df_updated, t) #assume this function returns a insert query
                query_props = q1
                sc = spark.sparkContext
                brConnectQ = sc.broadcast(query_props)
                db_conn = get_connection()
                db_cur = db_conn.cursor()
                update_final(df_updated, db_cur) 
        conn.commit()
        conn.close()

Explanation:

Here etl_process() internally calling get_connection() which returns a psycopg2 connection object. After that it's calling a update_final() which takes dataframe and psycopg2 cursor object as an arguments.
Now update_final() is calling process_partition_up() on each partition(df.rdd.coalesce(2).foreachPartition) which takes dataframe and psycopg2 cursor object as an arguments.
Here after passing psycopg2 cursor object to the process_partition_up(), I'm not getting cursor object rather I'm getting above error.

Can anyone help me out to resolve this error?

Thank you.

so, what's happening in the code, in a gist? are you trying to update/insert some records in a db but want to take inputs from a spark dataframe? you won't be able to do it the way you're trying as df.rdd does not return a set of rows that can be read by db connection. do you want to `collect()` and output the rows? — samkart, May 26 '22 at 09:38
Thanks for responding @samkart Actually I'm trying to insert some records in a db with the help of psycopg2 cursor object. Here I'm not able to pass psycopg2 cursor object in foreachPartition(). If I'm able to pass this cursor object in foreachPartition() then It would be easy for me to insert data. **Note: It's not necessary to use psycopg2, instead you'll use sqlalchemy or anyother database adapter.** — Vishal Barad, May 26 '22 at 09:57

score 0 · Accepted Answer · answered May 26 '22 at 13:59

I think that you don't understand what's happening here.

You are creating a database connection in your driver(etl_process), and then trying to ship that live connection from the driver, across your network to executor to do the work.(your lambda in foreachPartitions is executed on the executor.)

That is what spark is telling you "cannot pickle 'psycopg2.extensions.cursor'". (It can't serialize your live connection to the database to ship it to an executor.)

You need to call conn = get_connection() from inside process_partition_up this will initialize the connection to the database from inside the executor.(And any other book keeping you need to do.)

FYI: The worst part that I want to call out is that this code will work on your local machine. This is because it's both the executor and the driver.

If you found my comment helpful please upvote. If you are comfortable marking it as the answer I'd appreciate you marking it so when you have time. — Matt Andruff, Jun 29 '22 at 13:32

How to pass psycopg2 cursor object to foreachPartition() in pyspark?

1 Answers1