Order by and Join in SQL or spark or mapreduce

Question

I have two tables whose content is as below.

Table 1:
ID1     ID2     ID3    ID4                NAME    DESCR   STATUS        date   
1    -12134    17773    8001300701101    name1    descr1    INACTIVE    20121203
2    -12136    17773    8001300701101    name1    descr1    INACTIVE     20121202
3    -12138    17785    9100000161822    name3    descr3    INACTIVE    20121201
4    -12140    17785    9100000161822    name3   descr3    ACTIVE        20121130
5    -12142    17787    8000500039106    name4    descr4    ACTIVE        20121129

Table2:
ID1    ID2         ID3        ID4       NAME    DESCR  
0    17781    17773    8001300701101    name1    descr1
0    17783    17783    8001300060109    name2    descr2
0    17785    17785    9100000161822    name3    descr3
0    17787    17787    8000500039106    name4    descr4
0    17789    17789    0000080052364    name5    descr5

I am trying to get below result.

ID3        ID4            NAME     DESCR      STATUS        date        
17773    8001300701101    name1    descr1      INACTIVE    20121202
17783    8001300060109    name2    descr2      NULL        NULL
17785    9100000161822    name3    descr3      ACTIVE      20121201
17787    8000500039106    name4    descr4      ACTIVE      20121129
17789    0000080052364    name5    descr5      NULL        NULL

As per the above i/p and o/p, the two tables should be joined based on columns id3, id4, name and desc. if an active record exists, it should return the active record. but if inactive record only exists, then oldest inactive record should be joined.

I tried different queries which are no longer near to the answer i wanted. The four columns joined are all non primary fields but not nulls. There can be one to many or many to many relationship between the two tables.

I am working on Apache phoenix and if the solution is in Hadoop Mapreduce or in Apache Spark also OK.

A sample query i have written is as follows.

Select table2.*, table1.status, table1.date 
From table1 Right outer join table2 on table1.id3 = table2.id3 
            and  table1.id4 = table2.id4 
            and table1.name = table2.name 
            and table1.descr = table2.descr 
Order by (status) and order by (date)

Can any one help me please?

I believe your query will return more than one row if there are more than one row in table 1 that match. — David Griffin, May 30 '15 at 20:29
@David Griffin yes you are right. But I expect that only one active will exists if more than one I should pick up the oldest active element. If active element is not present at all and multiple inactive element exists I should pick oldest inactive — Satya, May 30 '15 at 22:38

David Griffin · Accepted Answer · 2015-05-31T11:35:16.547

You cannot do a straight join against Table 1. Instead, you have to join against multiple queries of Table 1, which are themselves joined together. By my count, you are going to have to do:

A query to find the minimum date for ACTIVE records in table 1 per ID3, ID4, etc.
A query to find the minimum date for INACTIVE records in table 1
A full outer join between the above two queries
coalesce to select ACTIVE versus INACTIVE fields.

Something like this:

val cookedTable1 = table1.filter(
  $"STATUS" === "ACTIVE"
).groupBy(
  $"ID3", $"ID4", $"NAME", $"DESCR", $"STATUS"
).agg(
  $"ID3", $"ID4", $"NAME", $"DESCR", $"STATUS", min($"date") as "date"
).join(
  table1.filter(
    $"STATUS" === "INACTIVE"
  ).groupBy(
    $"ID3", $"ID4", $"NAME", $"DESCR", $"STATUS"
  ).agg(
    $"ID3", $"ID4", $"NAME", $"DESCR", $"STATUS", min($"date") as "date"
  ).select(
    $"ID3" as "ID3r", $"ID4" as "ID4r", $"NAME" as "NAMEr", $"DESCR" as "DESCRr",
    $"STATUS" as "STATUSr", $"date" as "dater"
  ),
  $"ID3" === $"ID3r" and $"ID4" === $"ID4r" and $"NAME" === $"NAMEr" and $"DESCR" === $"DESCRr", 
  "full_outer"
)
.select(
  coalesce($"ID3", $"ID3r") as "ID3", 
  coalesce($"ID4",$"ID4r") as "ID4", 
  coalesce($"NAME", $"NAMEr") as "NAME",
  coalesce($"DESCR", $"DESCRr") as "DESCR",
  coalesce($"STATUS", $"STATUSr") as "STATUS",
  coalesce($"date", $"dater") as "date"
)

Given what you have in your above table 1, the result would look like:

cookedTable1.show

ID3   ID4           NAME  DESCR  STATUS   date
17785 9100000161822 name3 descr3 ACTIVE   20121130
17787 8000500039106 name4 descr4 ACTIVE   20121129
17773 8001300701101 name1 descr1 INACTIVE 20121202

Now, using cookedTable1 in place of table1, do the same query you did before:

cookedTable1.registerTempTable("cookedTable1")
val results = sqlContext("Select table2.*, cookedTable1.status, cookedTable1.date
  From cookedTable1 Right outer join table2 on cookedTable1.id3 = table2.id3
    and cookedTable1.id4 = table2.id4
    and cookedTable1.name = table2.name
    andcookedTable1.descr = table2.descr"
)

This should get you the results you were originally looking for.

thanks a lot. Now I have better picture. May I ask, that in the expected result, the records with no match are expected as well. is it not possible? I am yet to try your solution, so just a clarification please. — Satya, May 31 '15 at 06:59
Updated my answer. The "right outer join" part is what causes the records with no match to return `null` values instead of no record. — David Griffin, May 31 '15 at 11:37

score 0 · Answer 2 · answered May 30 '15 at 19:12

0

I can only speak for Spark. The query appears correct in terms of the Right Outer Join and the four join columns.

In Spark (and AFAIK in ANSI sql) the order by is not the way you show but instead:

order by status, date

answered May 30 '15 at 19:12

WestCoastProjects

58,982
91
316
560

Order by and Join in SQL or spark or mapreduce

2 Answers2