Append only new aggregates based on groupby keys

Question

I have to process some files which arrive to me daily. The information have primary key (date,client_id,operation_id). So I created a Stream which append only new data into a delta table:

operations\
        .repartition('date')\
        .writeStream\
        .outputMode('append')\
        .trigger(once=True)\
        .option("checkpointLocation", "/mnt/sandbox/operations/_chk")\
        .format('delta')\
        .partitionBy('date')\
        .start('/mnt/sandbox/operations')

This is working fine, but i need to summarize this information grouped by (date,client_id), so i created another streaming from this operations table to a new table:

summarized= spark.readStream.format('delta').load('/mnt/sandbox/operations')

summarized= summarized.groupBy('client_id','date').agg(<a lot of aggs>)

summarized.repartition('date')\
        .writeStream\
        .outputMode('complete')\
        .trigger(once=True)\
        .option("checkpointLocation", "/mnt/sandbox/summarized/_chk")\
        .format('delta')\
        .partitionBy('date')\
        .start('/mnt/sandbox/summarized')

This is working, but every time I got new data into operations table, spark recalculates summarized all over again. I tried to use the append mode on the second streaming, but it need watermarks, and the date is DateType.

There is a way to only calculate new aggregates based on the group keys and append them on the summarized?

pissall · Answer 1 · 2019-09-25T15:26:15.847

0

You need to use Spark Structured Streaming - Window Operations

When you use windowed operations, it will do the bucketing according to windowDuration and slideDuration. windowDuration tells you what is the length of the window, and slideDuration tells by how much time should you slide the window.

If you groupby using window() [docs], you will get a resultant window column along with other columns you groupby with like client_id

For example:

windowDuration = "10 minutes"
slideDuration = "5 minutes"
summarized = before_summary.groupBy(before_summary.client_id,
    window(before_summary.date, windowDuration, slideDuration)
).agg(<a lot of aggs>).orderBy('window')

edited Sep 25 '19 at 15:26

answered Sep 25 '19 at 15:19

pissall

7,109
2
25
45

can I use a Date to specify the window? – LeandroHumb Sep 25 '19 at 17:23
@LeandroHumb you will have to specify in ‘days’ or so – pissall Sep 25 '19 at 17:24
I hope your `date` column is a `timestamp` type – pissall Sep 25 '19 at 18:15
it is not, date is DateType. – LeandroHumb Sep 25 '19 at 18:26
Please try the `window` operation, if it doesn't work, convert the column to `timestamp` and try difference `windowDuration` and `slideDuration`, and let me know if there are any issues – pissall Sep 26 '19 at 02:20
tried to convert the field `date` to timestamp, and now the stream process works, but it don't write anything on the sink – LeandroHumb Sep 26 '19 at 19:32
i will just make it batch, can't struggle with this anymore, but thank you @pissall for your attention – LeandroHumb Sep 26 '19 at 19:33
@LeandroHumb if your question found a relevant answer, please upvote and select it. If you need help with writing it to the sink, please update your question or ask a new one. – pissall Sep 27 '19 at 03:10
nice ideia, I will write another one – LeandroHumb Sep 27 '19 at 12:25
https://stackoverflow.com/questions/58135188/streaming-aggregate-not-writing-into-sink – LeandroHumb Sep 27 '19 at 13:05

Append only new aggregates based on groupby keys

1 Answers1

Linked