pyspark structured streaming write to parquet in batches

Question

I am doing some transformation on the spark structured streaming dataframe. I am storing the transformed dataframe as parquet files in hdfs. Now I want that the write to hdfs should happen in batches instead of transforming the whole dataframe first and then storing the dataframe.

can you clarify? ... your persisted parquet files in hdfs are output from the structured streaming job or regular spark job? or are you trying to use structured streaming to write in mini batches to parquet in hdfs? — thePurplePython, Apr 26 '19 at 04:09
I am trying to write the parquet file in mini batches to the hdfs from my structured streaming. My source of structured stream is kafka . — Y0gesh Gupta, Apr 26 '19 at 12:16
Thanks for clarifying. I have provided some solutions for you to get started. — thePurplePython, Apr 26 '19 at 13:20

thePurplePython · Answer 1 · 2019-04-26T15:06:10.700

Here is a parquet sink example:

# parquet sink example
targetParquetHDFS = sourceTopicKAFKA
    .writeStream
    .format("parquet") # can be "orc", "json", "csv", etc.
    .outputMode("append") # can only be "append"
    .option("path", "path/to/destination/dir")
    .partitionBy("col") # if you need to partition
    .trigger(processingTime="...") # "mini-batch" frequency when data is outputed to sink
    .option("checkpointLocation", "path/to/checkpoint/dir") # write-ahead logs for recovery purposes
    .start()
targetParquetHDFS.awaitTermination()

For more specific details:

Kafka Integration: https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html

SS Programming Guide: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-sinks

added

Ok ... I added some stuff to the response to clarify your question.

SS has a few different Trigger Types:

https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#triggers

default: next trigger happens once previous trigger has completed processing

fixed intervals: .trigger(processingTime='10 seconds') so trigger of 10 seconds will fire at 00:10, 00:20, 00:30

one-time: processes all available data at once .trigger(once=True)

continuous / fixed checkpoint interval => best to see programming guide doc

Therefore in your Kafka example SS can process the data on the event-time timestamp at micro-batches via the "default" or "fixed interval" triggers or a "one-time" processing of all the data available in the Kafka source topic.

Thanks for your answer. I am trying to understand how does the trigger creates the microbatch. Suppose I have 20,000 messages streamed from Kafka. Will the transformation on the all the messages happen all at once as a single microbatch or will these message be converted into small microbatches within the time interval and then processed one by one. — Y0gesh Gupta, Apr 26 '19 at 14:48
@Y0geshGupta you're welcome ... I added more information to the response regarding your question — thePurplePython, Apr 26 '19 at 15:06
Thanks for the details makes it a bit more clear. I am still not sure though if I use processingTime to trigger will it store the parquet files from the dataset in small batches(transform in small batches and then store) or if it will do the transformation on a complete 20k records in dataframe and store them as parquet treating it as a single microbatch. I am taking 20k as number of records because those are the initial count of messages coming from my kafka topic. — Y0gesh Gupta, Apr 26 '19 at 15:32
If you do processingTime it appends new data (as parquet files) every "trigger" interval frequency based off the **event time in your source data timestamp** ... I am not super experienced with Kafka architecture but I assume your data is "streaming" and keeps track of the observation's event-time as a timestamp. — thePurplePython, Apr 26 '19 at 16:15

pyspark structured streaming write to parquet in batches

1 Answers1

added

Linked