0

I need to crack a spark query, Let say I have my data in delta table(tab) like:-

cust  date    acct_id        f1     f2  source_date(dd/mm/yy:h)
b1   1/10/22   acc1          x      y   9/9/22:1 P.M
b1   1/10/22   acc2          x      y   9/9/22:1 P.M
b1   1/10/22   acc3          x      y   9/9/22:1 P.M

Now I am sourcing changes from a db and just appending to my delta table, for example , if at any instant below are the changes I am getting from db(using kafka or any other message broker) using a scheduler which runs every 2 minutes:-

changes:-

b1   1/10/22   acc2          p      y   10/9/22:2 P.M

b1   1/10/22   acc2          p      q   11/9/22:4 P.M

b1   1/10/22   acc3          m      n   11/9/22:4 P.M

then I will run a spark job to just append the changes in my delta table and my delta table will look like :-

cust  date    acct_id        f1     f2  source_date(dd/mm/yy:h)
b1   1/10/22   acc1          x      y   9/9/22:1 P.M
b1   1/10/22   acc2          x      y   9/9/22:1 P.M
b1   1/10/22   acc3          x      y   9/9/22:1 P.M
b1   1/10/22   acc2          p      y   10/9/22:2 P.M
b1   1/10/22   acc2          p      q   11/9/22:4 P.M
b1   1/10/22   acc3          m      n   11/9/22:4 P.M

now I want to serve the below queries using spark:-

query: get data for cust=b1,date= 1/10/22, source_date=9/9/22:1 P.M

output:-
b1   1/10/22   acc1          x      y   9/9/22:1 P.M
b1   1/10/22   acc2          x      y   9/9/22:1 P.M
b1   1/10/22   acc3          x      y   9/9/22:1 P.M

query: get data for cust=b1,date= 1/10/22, source_date=10/9/22:2 P.M

output:-
b1   1/10/22   acc1          x      y   9/9/22:1 P.M
b1   1/10/22   acc2          p      y   10/9/22:2 P.M
b1   1/10/22   acc3          x      y   9/9/22:1 P.M

query: get data for cust=b1, date=1/10/22, source_date=11/9/22:4 P.M

output:-
b1   1/10/22   acc1          x      y   9/9/22:1 P.M
b1   1/10/22   acc2          p      q   11/9/22:4 P.M
b1   1/10/22   acc3          m      n   11/9/22:4 P.M
Kumar-Sandeep
  • 202
  • 1
  • 4
  • 14

1 Answers1

0

This has been solved using the below code:-

 Dataset result = data.filter(col("source_date").$less$eq("2022-09-10 02:00:00.0"))
            .groupBy("cust", "date", "acct_id")
            .agg(max("source_date").as("source_date"), last("f1").as("f1"), last("f2").as("f2"));
Kumar-Sandeep
  • 202
  • 1
  • 4
  • 14