How to convert Row to Dictionary in foreach() in pyspark?

Question

I have a dataframe generated from Spark which I want to use for writeStream and also want to save in a database.

I have the following code:

output = (
        spark_event_df
        .writeStream
        .outputMode('update')
        .foreach(writerClass(**job_config_data))
        .trigger(processingTime="2 seconds")
        .start()
    )
    output.awaitTermination()

As I am using foreach(), writerClass gets a Row and I can not convert it into a dictionary in python.

How can I get a python datatype(preferably dictionary) from the Row in my writerClass so that I can manipulate that according to my needs and save into database?

Might be useful: `https://stackoverflow.com/a/51380346/4245859` — Bitswazsky, Dec 24 '19 at 03:15

score 1 · Answer 1 · answered Dec 24 '19 at 14:14

If you're just looking to save to a database as part of your stream, you could do that using foreachBatch and the built-in JDBC writer. Just do your transformations to shape your data according to the desired output schema, then:

def writeBatch(input, batch_id):
  (input
    .write
    .format("jdbc")
    .option("url", url)
    .option("dbtable", tbl)
    .mode("append")
    .save())

output = (spark_event_df
           .writeStream
           .foreachBatch(writeBatch)
           .start())
output.awaitTermination()

If you absolutely need custom logic for writing to your database, that is not supported by the built-in JDBC writer, then you should use the DataFrame foreachPartition method to write your rows in bulk rather than one at a time. If you're using this method, then you can convert the Row objects into a dict by just calling asDict

How to convert Row to Dictionary in foreach() in pyspark?

1 Answers1