0

I have two tables which I want to join with a specific logic.

Table_1 ( S_No, ID, Date1, Date2 )

S_No    ID  Date1   Date2
1   id1 2014-05-01  2014-07-03
2   id1 2015-03-23  2016-06-18
3   id1 2016-06-21  2016-07-29

Table_2 ( S_No_New, ID_New, Date_New )

S_No_New    ID_New  Date_New
2_1 id1 2014-04-25
2_2 id1 2014-06-14
2_3 id1 2015-01-10
2_4 id1 2015-02-15
2_5 id1 2015-05-17
2_6 id1 2016-04-24
2_7 id1 2016-06-19
2_8 id1 2016-06-25
2_9 id1 2016-07-11
2_10    id1 2016-08-11
2_11    id1 2016-08-16

I want to join above two table in such a way that I get a count of how many rows are there in table_2 before Date1 and between Date1 and Date2 and then when we move to the next row we have to use the data which is not counted yet so far for the same id.

And if we have a date entry in table_2 after the last Date2 entry in table 1 then we need to append a new row with '+1" added to S_No and similar remaining column details.

Overall this problem can be split into two parts :
1) Getting the counts column
2) Adding up the extra rows ( S_No "4" in this example )

Please drop an answer if you know solution to either of the two.

Final output :

S_No    ID  Date1   Date2   Count_pre   Count_Between
1   id1 2014-05-01  2014-07-03  1   1
2   id1 2015-03-23  2016-06-18  2   2
3   id1 2016-06-21  2016-07-29  1   2
4   id1 NULL    NULL    2   0

Logic :
S_No 1 :
Count_Pre = Dates before 2014-05-01
Count_between = Dates between 2014-05-01 and 2014-07-03

S_No 2 :
Count_Pre = Dates between 2014-07-03 and 2015-03-23
Count_between = Dates between 2015-03-23 and 2016-06-18
and so on

Intermediate table has to look something like this:

S_No    ID  Date    Date2   S_No_New    Date_New
1   id1 2014-05-01  2014-07-03  2_1 2014-04-25
1   id1 2014-05-01  2014-07-03  2_2 2014-06-14
2   id1 2015-03-23  2016-06-18  2_3 2015-01-10
2   id1 2015-03-23  2016-06-18  2_4 2015-02-15
2   id1 2015-03-23  2016-06-18  2_5 2015-05-17
2   id1 2015-03-23  2016-06-18  2_6 2016-04-24
3   id1 2016-06-21  2016-07-29  2_7 2016-06-19
3   id1 2016-06-21  2016-07-29  2_8 2016-06-25
3   id1 2016-06-21  2016-07-29  2_9 2016-07-11
4   id1 NULL    NULL    2_10    2016-08-11
4   id1 NULL    NULL    2_11    2016-08-16

I was trying out different windowing and analytics function but couldn't get through this problem. Is it possible to do this kind of join in hive ( basic sql ) ?

NOTE : EDIT 2 : I need to implement this in hive and it supports all the builtin functions but not the variables from mysql. It supports aggregate, windowing and analytics funtions.

EDIT : Changed the date format from dd/mm/yyyy to yyyy-mm-dd

Arpit Gupta
  • 501
  • 1
  • 5
  • 14

1 Answers1

0
    SELECT t.t1s_no,t.date1,t.date2,
         sum(case when t.srce = 'P' then 1 else 0 end) as 'prev',
         sum(case when t.srce = 'B' then 1 else 0 end) as 'between',
         sum(case when t.srce = 'X' then 1 else 0 end) as 'missing'
FROM 
(
SELECT S.*,
        ROW_NUMBER() OVER (PARTITION BY S.DATE_NEW ORDER BY s.srce ,S.DATE1) RN
FROM
(SELECT 'P' AS SRCE,T1.S_NO T1S_NO,T1.ID T1ID,T1.DATE1 DATE1,T1.DATE2 DATE2,T2.DATE_NEW
FROM    TABLE_1 T1
JOIN    TABLE_2 T2 ON T2.DATE_NEW < T1.DATE1
UNION 
SELECT 'B' AS SRCE,T1.S_NO T1S_NO,T1.ID T1ID,T1.DATE1 DATE1,T1.DATE2 DATE2,T2.DATE_NEW
FROM    TABLE_1 T1
JOIN    TABLE_2 T2 ON T2.DATE_NEW BETWEEN T1.DATE1 AND T1.DATE2
UNION 
SELECT 'X' AS SRCE,4 T1S_NO,T1.ID T1ID,T1.DATE1 DATE1,T1.DATE2 DATE2,T2.DATE_NEW
FROM    TABLE_2 T2
left JOIN   TABLE_1 T1 ON (T2.DATE_NEW BETWEEN T1.DATE1 AND T1.DATE2) or (t2.date_new < t1.date1) 
where   t1.date1 is null
) S 
) T
WHERE T.RN = 1 
group   by t.t1s_no,t.date1,t.date2
ORDER BY T.T1S_NO, T.DATE1
;

Result

t1s_no      date1            date2            prev        between     missing
----------- ---------------- ---------------- ----------- ----------- -----------
          1       2014-05-01       2014-07-03           1           1           0
          2       2015-03-23       2016-06-18           2           2           0
          3       2016-06-21       2016-07-29           1           2           0
          4             NULL             NULL           0           0           2

(4 rows affected)
P.Salmon
  • 17,104
  • 2
  • 12
  • 19
  • For S_No = 2 it should actually be 2 each. Have updated in question. I wanted to use it in hive and it doesn't support variables, is it possible to do this without the use of variables. Please read the edit for more updates – Arpit Gupta Oct 20 '16 at 04:38
  • Sorry my bad about the update. Should have put it earlier regarding the hive implementation. – Arpit Gupta Oct 20 '16 at 04:45
  • I don't know hive but it seems to have a row_number() function which would possibly be equivalent to the variable method I have used. If the hive row_number() function looks like this row_number() over (partition by ... order by ...) then I could update my answer using t-sql. Let me know if you wish me to do so. – P.Salmon Oct 20 '16 at 07:53
  • Yes hive support row_number() function like row_number() over (partition by ... order by ...). It would be really helpful if you can modify the above query. Thanks in advance. – Arpit Gupta Oct 20 '16 at 08:56