2

I have a data set like this (DDL below):

+----+------------------+----------------------+---------------------+
| ID | NAME             | EMAIL                | LAST_UPD            |
+----+------------------+----------------------+---------------------+
| 1  | JOHN SMITH       | JOHN.SMITH@GMAIL.COM | 29/04/2017 10:50:51 |
+----+------------------+----------------------+---------------------+
| 1  | J SMITH          | JOHN.SMITH@GMAIL.COM | 29/04/2017 10:51:15 |
+----+------------------+----------------------+---------------------+
| 1  | J SMITH          | JOHN.SMITH@GMAIL.COM | 29/04/2017 10:51:36 |
+----+------------------+----------------------+---------------------+
| 1  | JOHN JAMES SMITH | JOHN.SMITH@GMAIL.COM | 29/04/2017 10:52:11 |
+----+------------------+----------------------+---------------------+
| 2  | JAMES FORD       | JAMES.FORD@GMAIL.COM | 29/04/2017 10:52:57 |
+----+------------------+----------------------+---------------------+
| 2  | JAMES FORD       | J.FORD@GMAIL.COM     | 29/04/2017 10:53:17 |
+----+------------------+----------------------+---------------------+
| 2  | JAMES FORD       | J.FORD@GMAIL.COM     | 29/04/2017 11:47:15 |
+----+------------------+----------------------+---------------------+

I'm trying to get a last update date (timestamp for change) for name and email columns and the corresponding values from these fields within IDs. If the given attribute has not changed, the minimun LAST_UPD should be received. I tried this and got the values I would like to have but how to "squeeze" this for given ID?

SELECT 
ID,
NAME,
CASE 
  WHEN LAG(NAME)OVER(PARTITION BY ID ORDER BY LAST_UPD) != NAME 
  THEN LAST_UPD 
  WHEN LEAD(NAME)OVER(PARTITION BY ID ORDER BY LAST_UPD) = NAME
  THEN LAST_UPD
END NAME_CHANGED,
EMAIL, 
CASE 
  WHEN LAG(EMAIL)OVER(PARTITION BY ID ORDER BY LAST_UPD) != EMAIL 
  THEN LAST_UPD 
  WHEN LEAD(EMAIL)OVER(PARTITION BY ID ORDER BY LAST_UPD) = EMAIL
  THEN LAST_UPD
END EMAIL_CHANGED
FROM CUSTOMER
;

The result should be this:

+----+------------------+---------------------+----------------------+---------------------+
| ID | NAME             | NAME_CHANGED        | EMAIL                | EMAIL_CHANGED       |
+----+------------------+---------------------+----------------------+---------------------+
| 1  | JOHN JAMES SMITH | 29/04/2017 10:52:11 | JOHN.SMITH@GMAIL.COM | 29/04/2017 10:50:51 |
+----+------------------+---------------------+----------------------+---------------------+
| 2  | JAMES FORD       | 29/04/2017 10:52:57 | J.FORD@GMAIL.COM     | 29/04/2017 10:53:17 |
+----+------------------+---------------------+----------------------+---------------------+

DDL:

CREATE TABLE CUSTOMER 
(
  ID VARCHAR2(20) 
, NAME VARCHAR2(50) 
, EMAIL VARCHAR2(50) 
, LAST_UPD DATE 
);

REM INSERTING into CUSTOMER
SET DEFINE OFF;
Insert into CUSTOMER (ID,NAME,EMAIL,LAST_UPD) values ('1','JOHN SMITH','JOHN.SMITH@GMAIL.COM',to_date('29/04/2017 10:50:51','DD/MM/YYYY HH24:MI:SS'));
Insert into CUSTOMER (ID,NAME,EMAIL,LAST_UPD) values ('1','J SMITH','JOHN.SMITH@GMAIL.COM',to_date('29/04/2017 10:51:15','DD/MM/YYYY HH24:MI:SS'));
Insert into CUSTOMER (ID,NAME,EMAIL,LAST_UPD) values ('1','J SMITH','JOHN.SMITH@GMAIL.COM',to_date('29/04/2017 10:51:36','DD/MM/YYYY HH24:MI:SS'));
Insert into CUSTOMER (ID,NAME,EMAIL,LAST_UPD) values ('1','JOHN JAMES SMITH','JOHN.SMITH@GMAIL.COM',to_date('29/04/2017 10:52:11','DD/MM/YYYY HH24:MI:SS'));
Insert into CUSTOMER (ID,NAME,EMAIL,LAST_UPD) values ('2','JAMES FORD','JAMES.FORD@GMAIL.COM',to_date('29/04/2017 10:52:57','DD/MM/YYYY HH24:MI:SS'));
Insert into CUSTOMER (ID,NAME,EMAIL,LAST_UPD) values ('2','JAMES FORD','J.FORD@GMAIL.COM',to_date('29/04/2017 10:53:17','DD/MM/YYYY HH24:MI:SS'));
Insert into CUSTOMER (ID,NAME,EMAIL,LAST_UPD) values ('2','JAMES FORD','J.FORD@GMAIL.COM',to_date('29/04/2017 11:47:15','DD/MM/YYYY HH24:MI:SS'));
COMMIT;

SELECT * FROM CUSTOMER;
jrara
  • 16,239
  • 33
  • 89
  • 120

4 Answers4

2

I think the key idea is to start with a flag that indicates whether an email or name changed. You can get this using lag. And, with the right logic, you will even mark the first record as a change.

Then, you want the last record for each of the columns that is marked as a change. The following code does this using the first_value() function -- because it can ignore null values:

select distinct id,
       first_value((case when name_changed then name end) ignore nulls) over (partition by id order by last_upd desc),
       max(case when name_changed then last_upd end) over (partition by id) as last_upd_name,
       first_value((case when email_changed then email end) ignore nulls) over (partition by id order by last_upd desc),
       max(case when email_changed then last_upd end) over (partition by id) as last_upd_email
from (select c.*,
             (case when c.name = lag(c.name) over (partition by c.id over order by c.last_upd) as name_changed,
                   then 0 else 1
              end) as name_changed,
             (case when c.email = lag(c.email) over (partition by c.id over order by c.last_upd) as email_change
                   then 0 else 1
              end) as email_changed
      from customer c
     ) c;
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • Thanks, this does not seem to work straight with Oracle database, trying to figure out how to change the syntax. – jrara Apr 29 '17 at 14:38
  • @jrara . . . This is intended for Oracle. This is probably some simple syntax error like mismatched parentheses or a missing comma. – Gordon Linoff Apr 29 '17 at 15:34
1
select id, name, max(nc) name_changed, email, max(mc) email_changed
  from (
   select id,
          first_value(name) over(partition by id order by last_upd desc) name,
          case lead(name) over(partition by id order by last_upd desc)
               when name then NULL else last_upd end nc,
          first_value(email) over(partition by id order by last_upd desc) email,
          case lead(email) over(partition by id order by last_upd desc)
               when email then NULL else last_upd end mc
     from CUSTOMER
  )
 group by id,name,email
Mike
  • 1,985
  • 1
  • 8
  • 14
  • Thanks, this works also. It is slightly slower on my original data set (5,5M rows) that the solution @Gordon Linoff provied. – jrara Apr 30 '17 at 09:49
  • @jrara Slightly corrected. -1 window sort. compare the speed, if not difficult. How many resulting rows are in the sample on your data ? – Mike Apr 30 '17 at 10:06
0
with data as
(
  select ROWNUM AS RN, I.* 
  from 
  (
    select id,COL, VAL, LAST_UPD from customer
    unpivot(val for (col) in (NAME, EMAIL)) order by id, col, last_upd
  ) I
)
,
cte (rn, id, col, val, last_upd, dummy) as
(
  select rn, id, col, val, last_upd, 1 
  from data 
  where rn in (select rn from (select rn, min(rn) over (partition by id, col) m from data) where rn = m)
  union all
  select 
    data.rn, data.id, data.col, 
    case when cte.val = data.val then cte.val else data.val end, 
    case when cte.val = data.val then cte.last_upd else data.last_upd end, 
    cte.dummy+1 
  from 
    data, cte 
  where 
    cte.rn + 1 = data.rn and cte.col = data.col and cte.id = data.id
)
,
rs as
(
  select * from (
  select cte.*, max(dummy) over (partition by id, col) m from cte
  order by rn, id, col) where dummy = m
)
SELECT 
  n.ID, n.val as NAME, n.last_upd as NAME_CHANGED, 
  m.VAL as EMAIL, m.lAST_UPD as EMAIL_CHANGED
FROM
  (select * from rs where col = 'NAME') n 
  join
  (select * from rs where col = 'EMAIL') m
  on (n.id = m.id)
;
Slkrasnodar
  • 824
  • 6
  • 10
0

Based on the answer of @Gordon Linoff, modified for Oracle database, this works:

WITH CUST AS (
SELECT ID,
      NAME,
      EMAIL,
      LAST_UPD,
      CASE WHEN NAME = LAG(NAME) OVER (PARTITION BY ID ORDER BY LAST_UPD) THEN 0 ELSE 1 END AS NAME_CHANGED,
      CASE WHEN EMAIL = LAG(EMAIL) OVER (PARTITION BY ID ORDER BY LAST_UPD) THEN 0 ELSE 1 END AS EMAIL_CHANGED
FROM CUSTOMER
)
SELECT DISTINCT CUST.ID,
       FIRST_VALUE(CASE WHEN NAME_CHANGED = 1 THEN CUST.NAME END) IGNORE NULLS OVER (PARTITION BY ID ORDER BY LAST_UPD DESC RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS NAME,
       MAX(CASE WHEN CUST.NAME_CHANGED = 1 THEN CUST.LAST_UPD END) OVER (PARTITION BY CUST.ID) AS LAST_UPD_NAME,
       FIRST_VALUE(CASE WHEN EMAIL_CHANGED = 1 THEN EMAIL END) IGNORE NULLS OVER (PARTITION BY ID ORDER BY LAST_UPD DESC RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS EMAIL,
       MAX(CASE WHEN CUST.EMAIL_CHANGED  = 1 THEN CUST.LAST_UPD END) OVER (PARTITION BY CUST.ID) AS LAST_UPD_EMAIL
FROM CUST
ORDER BY CUST.ID
;      

Resulting:

+----+------------------+---------------------+----------------------+---------------------+
| ID | NAME             | NAME_CHANGED        | EMAIL                | EMAIL_CHANGED       |
+----+------------------+---------------------+----------------------+---------------------+
| 1  | JOHN JAMES SMITH | 29/04/2017 10:52:11 | JOHN.SMITH@GMAIL.COM | 29/04/2017 10:50:51 |
+----+------------------+---------------------+----------------------+---------------------+
| 2  | JAMES FORD       | 29/04/2017 10:52:57 | J.FORD@GMAIL.COM     | 29/04/2017 10:53:17 |
+----+------------------+---------------------+----------------------+---------------------+
jrara
  • 16,239
  • 33
  • 89
  • 120