-2

Sorry if this has already been asked. I see a lot of similar questions but none exactly like this one.
I am trying to de-dup a large set (about 500 M) records:

Sample data:

CUST_ID  PROD_TYPE  VALUE  DATE
------------------------------------
1        1          Y      5/1/2015 *
1        2          N      5/1/2015 *
1        1          N      5/2/2015 *
1        2          N      5/2/2015 
1        1          Y      5/3/2015 *
1        2          Y      5/3/2015 *
1        1          Y      5/6/2015 
1        2          N      5/6/2015 *

By CUST_ID and PROD_TYPE, I need to retain the initial records as well as any records having a changed VALUE (the records with the asterisks). There can sometimes be gaps between the dates. There are around 5M unique CUST_ID's.

Any help would be greatly appreciated.

Roman Marusyk
  • 23,328
  • 24
  • 73
  • 116
SQLGal
  • 3
  • 1

2 Answers2

0

Not sure why LAG isn't working for you, this returns your results:

with t as (
     select 1 as CUST_ID, 1 as PROD_TYPE, 'Y' as VALUE, '5/1/2015' as [Date]
    union
    select 1, 2, 'N', '5/1/2015'
    union
    select 1, 1, 'N', '5/2/2015'
    union
    select 1, 2, 'N', '5/2/2015'
    union
    select 1,1, 'Y', '5/3/2015'
    union 
    select 1, 2, 'Y','5/3/2015'
    union
    select 1,1, 'Y', '5/6/2015'
    union
    select 1, 2,'N','5/6/2015')

select 
    *,
    case when 
        value <>
        isnull(lag(value) over (partition by cust_id, prod_type order by [date]),'') 
        then 1 else 0 
    end as keep 
from
    t
order by
    [date],
    cust_id,
    prod_type
Kyle Hale
  • 7,912
  • 1
  • 37
  • 58
0

Thanks Kyle, that is exactly correct, and I was able to use that as a solution to my problem. The issue I was having (not being familiar with lag) was that I had failed to provide a default, so the gap in dates was creating a NULL value which was giving me problems, but once I provided that, it worked like a charm. Thanks!

SQLGal
  • 3
  • 1