2

I have a dataframe generated from Spark which I want to use for writeStream and also want to save in a database.

I have the following code:

output = (
        spark_event_df
        .writeStream
        .outputMode('update')
        .foreach(writerClass(**job_config_data))
        .trigger(processingTime="2 seconds")
        .start()
    )
    output.awaitTermination()

As I am using foreach(), writerClass gets a Row and I can not convert it into a dictionary in python.

How can I get a python datatype(preferably dictionary) from the Row in my writerClass so that I can manipulate that according to my needs and save into database?

Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
Rafiul Sabbir
  • 626
  • 6
  • 21

1 Answers1

1

If you're just looking to save to a database as part of your stream, you could do that using foreachBatch and the built-in JDBC writer. Just do your transformations to shape your data according to the desired output schema, then:

def writeBatch(input, batch_id):
  (input
    .write
    .format("jdbc")
    .option("url", url)
    .option("dbtable", tbl)
    .mode("append")
    .save())

output = (spark_event_df
           .writeStream
           .foreachBatch(writeBatch)
           .start())
output.awaitTermination()

If you absolutely need custom logic for writing to your database, that is not supported by the built-in JDBC writer, then you should use the DataFrame foreachPartition method to write your rows in bulk rather than one at a time. If you're using this method, then you can convert the Row objects into a dict by just calling asDict

Silvio
  • 3,947
  • 21
  • 22