How to get aggregated data for a particular day in spark structured streaming

Question

i have one spark structured steaming job that read streams from kafka and write output to HDFS. My issue is i need an aggregated results for the entire day till particular time. Since spark structured streaming doesn't support complete/update mode, is there nay way to achieve same?

if i get data 10.00 AM , i need an aggregated result till 10.00 AM for the current date...

can some one help how to achieve same ?

score 0 · Answer 1 · answered Feb 09 '19 at 17:22

I'm not sure I get the exact specific of the situation, but let me try to answer.

I would recommend doing a 2-step process:

Spark streaming saves mini-batches to a temporary folder of format:

/yyy-mm-dd/<offset from the day start>.parquet

2019-02-06/100000.parquet, 2019-02-06/200000.parquet

Another spark job reads from the corresponding location and do the aggregation and time filtering.

You can use library like luigi to manage these.

How to get aggregated data for a particular day in spark structured streaming

1 Answers1