0

I am trying to implement a logic in Redshift Spectrum where my original table looks like below:

Records in the student table:

1 || student1 || Boston   || 2019-01-01  
2 || student2 || New York || 2019-02-01
3 || student3 || Chicago  || 2019-03-01 
1 || student1 || Dallas   || 2019-03-01

Records in the incremental table studentinc looks like below:

1 || student1 || SFO       || 2019-04-01
4 || student4 || Detroit   || 2019-04-01

By joining both student and studentinc tables, I am trying to get the latest set of records which should look like below:

2 || student2 || New York  || 2019-02-01
3 || student3 || Chicago   || 2019-03-01
1 || student1 || SFO       || 2019-04-01
4 || student4 || Detroit   || 2019-04-01

I have got this solution by doing UNION of both student and studentinc, then querying the result of union based on max(modified_ts). However, this solution isn't good for huge tables, is there a better solution which works by joining both the tables?

hadooper
  • 726
  • 1
  • 6
  • 18
  • 'sn't good for huge tables' - Have you proved this or are you surmising? AND what are you calling huge? – P.Salmon Oct 02 '19 at 06:00

2 Answers2

1

1. Using Spark-SQL you can achieve this by using not in and union

scala> var df1 = Seq((1 ,"student1","Boston  " , "2019-01-01"  ),(2 ,"student2","New York" , "2019-02-01"),(3 ,"student3","Chicago " , "2019-03-01" ),(1 ,"student1","Dallas  " , "2019-03-01")).toDF("id","name","country","_date")

register as temp table

scala> df1.registerTempTable("temp1")
scala> sql("select * from temp1") .show
+---+--------+--------+----------+
| id|    name| country|     _date|
+---+--------+--------+----------+
|  1|student1|Boston  |2019-01-01|
|  2|student2|New York|2019-02-01|
|  3|student3|Chicago |2019-03-01|
|  1|student1|Dallas  |2019-03-01|
+---+--------+--------+----------+

2nd DataFrame

scala> var df3 = Seq((1 , "student1", "SFO", "2019-04-01"),(4 , "student4", "Detroit", "2019-04-01")).toDF("id","name","country","_date")

scala> df3.show
+---+--------+-------+----------+
| id|    name|country|     _date|
+---+--------+-------+----------+
|  1|student1|    SFO|2019-04-01|
|  4|student4|Detroit|2019-04-01|
+---+--------+-------+----------+

performing not in with union clause

scala> sql("select * from (select * from temp1 where id not in (select id from temp2 ) )tt") .union(df3).show
+---+--------+--------+----------+
| id|    name| country|     _date|
+---+--------+--------+----------+
|  2|student2|New York|2019-02-01|
|  3|student3|Chicago |2019-03-01|
|  1|student1|     SFO|2019-04-01|
|  4|student4| Detroit|2019-04-01|
+---+--------+--------+----------+

2nd using Spark Dataframe this is faster than IN query becoz IN performs a row-wise operation.

scala> df1.join(df3,Seq("id"),"left_anti").union (df3).show
+---+--------+--------+----------+
| id|    name| country|     _date|
+---+--------+--------+----------+
|  2|student2|New York|2019-02-01|
|  3|student3|Chicago |2019-03-01|
|  1|student1|     SFO|2019-04-01|
|  4|student4| Detroit|2019-04-01|
+---+--------+--------+----------+

Hope it helps you. let me know if you have any query related to the same

Mahesh Gupta
  • 1,882
  • 12
  • 16
0

I would recommend window functions:

select s.*
from (select s.*,
             row_number() over (partition by studentid order by date desc) as seqnum
      from ((select s.* from student
            ) union all
            (select i.* from incremental
             from incremental
            )
           ) s
     ) s
where seqnum = 1;

Note: The union all requires that the columns be exactly the same and in the same order. You may need to list out the columns if they are not the same.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786