0

I'm doing an analysis using SQL and R, I want to join two tables like below listed:

Table 1:

ID  date
a11 20150302
a11 20150302
a22 20150303
a22 20150304
a33 20150306
a44 20150306
a55 20150307
a66 20150308
a66 20150309
a66 20150310

Table 2

ID  date
a11 20150303
a22 20150304
a22 20150305
a44 20150306
a66 20150308
a66 20150310

The situation is like: Customer got called (table1) and Customer called back for more info (table two)

So what I want to do in the analysis is to:

  1. Only show IDs that in both table.
  2. Match table 2 date to table 1 date by:
    • Match the closest date
    • table 2 date must >= table 1 date (like the example in result "a66" 20150310 is assigned to table1 date 20150310, while 20150308 is assigned to 20150308, not 20150309)

Result:

ID  table1 date table2 date
a11 20150302    
a11 20150302    20150303
a22 20150303    20150304
a22 20150304    20150305
a44 20150306    20150306
a66 20150308    20150308
a66 20150309    
a66 20150310    20150310

Is there any solutions for this many to many (but I don't want the n*m as result, I want 1 to 1) matching/join? Solution either in R or SQL will be wanted.

Thanks

Juan Carlos Oropeza
  • 47,252
  • 12
  • 78
  • 118
WayToNinja
  • 285
  • 4
  • 14
  • If table 2 date must >= table 1 date, why are there blank spaces? in table2 date? – Pierre Lapointe Nov 06 '15 at 20:34
  • why `a11` has two rows? your 2.1 rule say match with closest date. `null` isnt closest date. – Juan Carlos Oropeza Nov 06 '15 at 20:54
  • 1
    Can you explain your rules a litle better. Dont know why `a11 20150302` have null and `20150303` at same time. Also why `a66 20150309 null` and not `a66 20150309 20150310` – Juan Carlos Oropeza Nov 06 '15 at 21:12
  • 1. Blank space because for certain ID, table 1 have 3 records but table 2 have only 2 records... Just like called customer 3 times but customer only called back 2 times, so blank means that time customer didn't call back – WayToNinja Nov 06 '15 at 21:40
  • The rule is more like, the match is from table 2 side, the table 2 will match its date to the closest date in table 1. And for a11, since table 2 only have one value so even if a11 have two 20150302, but I want to only match to one of them so it will show that customer only call backed once... not twice – WayToNinja Nov 06 '15 at 21:44
  • what is your rdbms? btw if you include user name with `@KarlTian` that person get a notification. Otherwise you have to wait they check the question again. The one making the question always get a notification – Juan Carlos Oropeza Nov 06 '15 at 21:50
  • So why `a66 20150309` get null? how you know table2 `20150310` isnt a reply to that one? and the one without reply is table1 `20150310` – Juan Carlos Oropeza Nov 06 '15 at 22:16
  • @Juan Carlos Oropeza, Netezza. and yeah....you made a good point... I want to show some null value because I don't want to confuse that how many times people call back – WayToNinja Nov 06 '15 at 22:22
  • So you dont care which one I choose? – Juan Carlos Oropeza Nov 06 '15 at 22:23

3 Answers3

1
SELECT ID, Date1, Date2 FROM (
SELECT joined.ID,  joined.Date1, joined.Date2, ROW_NUMBER() OVER (PARTITION BY ID, Date1 ORDER BY Date2 ASC) AS RowNumber 
FROM(
SELECT t1.ID, t1.[Date] as Date1, CASE WHEN t2.[Date] >= t1.[Date] THEN t2.[Date] ELSE NULL END as [Date2] 
FROM Table1 t1
LEFT JOIN Table2 t2 ON t1.ID = t2.ID) as joined 
WHERE joined.Date2 IS NOT NULL
) partitioned
WHERE RowNumber = 1

Joins the two tables on ID and removes the rows in Table 2 that is not in Table 1. Then uses the ROW_NUMBER() OVER (PARTITION BY ID, Date1 ORDER BY Date2 ASC) to match the closest date which is found by the WHERE RowNumber = 1 clause.

Produces this output that is consistent with the conditions you listed:

+-----+----------+----------+
| ID  |  Date1   |  Date2   |
+-----+----------+----------+
| a11 | 20150302 | 20150303 |
| a22 | 20150303 | 20150304 |
| a22 | 20150304 | 20150304 |
| a44 | 20150306 | 20150306 |
| a66 | 20150308 | 20150308 |
| a66 | 20150309 | 20150310 |
| a66 | 20150310 | 20150310 |
+-----+----------+----------+
Mark He
  • 735
  • 7
  • 14
  • Thanks for answering, but isn't it will make the output: 1st row of a66 become 20150308 paired with both 20150308 and 20150310? It should be only 20150308 on the table2 date column – WayToNinja Nov 06 '15 at 21:00
  • @KarlTian edited my answer, it matches your conditions. however, the result set in your question is not consistent with these conditions. – Mark He Nov 06 '15 at 21:18
  • Looks really nice, Thank you @Markmanguy. Just one more question, why in your output, column date2, the a22's result is not **20150304 and 20150305**, but **two 20150304** – WayToNinja Nov 06 '15 at 22:04
  • @KarlTian because the closest date to `20150304` is `20150304`, not `20150305` – Mark He Nov 06 '15 at 23:25
1

I get the same result as markmanguy in R with dplyr. For a22, the closest callback for the 20150304 initial call is 20150304, not 20150305. You need a time component to distinguish this.

library(dplyr)
inner_join(table1,table2,"ID")%>%
group_by(ID,date1)%>%
filter(date1<=date2)%>%
filter(row_number() == 1)

>
Source: local data frame [7 x 3]
Groups: ID, date1 [7]

     ID    date1    date2
  (chr)    (int)    (int)
1   a11 20150302 20150303
2   a22 20150303 20150304
3   a22 20150304 20150304
4   a44 20150306 20150306
5   a66 20150308 20150308
6   a66 20150309 20150310
7   a66 20150310 20150310

Data

table1 <-read.table(text="ID  date1
a11 20150302
a11 20150302
a22 20150303
a22 20150304
a33 20150306
a44 20150306
a55 20150307
a66 20150308
a66 20150309
a66 20150310", header=T,stringsAsFactors =F)
table2 <-read.table(text="ID  date2
a11 20150303
a22 20150304
a22 20150305
a44 20150306
a66 20150308
a66 20150310", header=T,stringsAsFactors =F)
Pierre Lapointe
  • 16,017
  • 2
  • 43
  • 56
1

This doesnt solve it but is close and maybe will give you an idea

SqlFiddleDemo

With t_left as (
    SELECT *, row_number() over (partition by "ID" order by date desc ) as rn
    FROM Table1 T
    WHERE EXISTS (SELECT 1 FROM Table2 P WHERE T."ID" = P."ID")
),
t_right as (
    SELECT *, row_number() over (partition by "ID" order by date desc) as rn
    FROM Table2
) 
SELECT t_left."ID", t_left."date", t_right."date"
FROM t_left
LEFT JOIN t_right
       on t_left.rn = t_right.rn
      and t_left."ID" = t_right."ID"
ORDER BY t_left."ID", t_left."date"

OUTPUT

|  ID |     date |     date |
|-----|----------|----------|
| a11 | 20150302 | 20150303 |
| a11 | 20150302 |   (null) |
| a22 | 20150303 | 20150304 |
| a22 | 20150304 | 20150305 |
| a44 | 20150306 | 20150306 |
| a66 | 20150308 |   (null) |
| a66 | 20150309 | 20150308 |
| a66 | 20150310 | 20150310 |
Juan Carlos Oropeza
  • 47,252
  • 12
  • 78
  • 118