0

i have one spark structured steaming job that read streams from kafka and write output to HDFS. My issue is i need an aggregated results for the entire day till particular time. Since spark structured streaming doesn't support complete/update mode, is there nay way to achieve same?

if i get data 10.00 AM , i need an aggregated result till 10.00 AM for the current date...

can some one help how to achieve same ?

BigD
  • 850
  • 2
  • 17
  • 40

1 Answers1

0

I'm not sure I get the exact specific of the situation, but let me try to answer.

I would recommend doing a 2-step process:

  1. Spark streaming saves mini-batches to a temporary folder of format:

/yyy-mm-dd/<offset from the day start>.parquet

2019-02-06/100000.parquet, 2019-02-06/200000.parquet

  1. Another spark job reads from the corresponding location and do the aggregation and time filtering.

You can use library like luigi to manage these.

Vlad Vlaskin
  • 110
  • 8