0

I have following data set for data lake which is acting as source for Dimension where I want to migrate the history data in Dimension

For e.g.: image

Primarykey       Checksum     DateFrom     Dateto      ActiveFlag 
  1                  11         01:00       03:00         False
  1                  22         03:00       05:00         False 
  1                  22         05:00       07:00         False
  1                  11         07:00       09:00         False
  1                  11         09:00    12/31/999         TRUE

Please note that datalake table have multiple columns which are not part of dimension so we are recalculating the checking show same value but datefrom and dateto

with base as (
Select 
   Primary_key,
   checksum,
   first_value ( datefrom ) over ( partition by Primary_key ,checksum order by datefrom ) as Datefrom,
   last_value ( dateto ) over ( partition by Primary_key  ,checksum order by datefrom ) as Dateto,
   rownumber () over ( partition by Primary_key  ,checksum order by datefrom ) as latest_record 
from Datalake.user)
select * from base where latest_record = 1

Data shown as

Primarykey       Checksum     DateFrom     Dateto 
   1              11           01:00         12/31/999 
   1              22           03:00         07:00

But Expected out is

Primarykey       Checksum     DateFrom     Dateto 
   1              11           01:00         03:00 
   1              22           03:00         07:00
   1              11           07:00         12/31/999 

I tried using multiple ways in single query but any good suggestions?

Simeon Pilgrim
  • 22,906
  • 3
  • 32
  • 45

3 Answers3

0

The reason you are get only two rows is you have two columns in your partitions Primarykey and checksum and those only have two combinations. The line you are wanting in the Expected output has the same Primarykey and checksum (1,11) as the first row in your expected output.

The thing I see in your data that would get your result would be if you included ActiveFlag into your partitions.

WITH base AS (
    SELECT 
       primary_key,
       checksum,
       FIRST_VALUE (datefrom) OVER ( PARTITION BY primary_key, checksum, active_flag order by datefrom) AS datefrom,
       LAST_VALUE (dateto) OVER ( partition BY primary_key, checksum, active_flag order by datefrom) AS dateto,
       ROWNUMBER () OVER ( partition BY primary_key, checksum, active_flag order by datefrom) AS latest_record 
    FROM Datalake.user
)
SELECT * FROM base WHERE latest_record = 1
Simeon Pilgrim
  • 22,906
  • 3
  • 32
  • 45
  • This will not work as active flag can be false and next record with True with different PK inserted. – snowflakeuser Nov 22 '19 at 05:48
  • @snowflakeuser you cannot get the answer you want with the limited information. I didn't say it would work, for all the thing you didn't mention, but for the data shown that would have the effect. In the end of the day this is your problem to solve, we can just help teach you how the tools work, or show idea that seem to fit the task as you have described it. – Simeon Pilgrim Nov 23 '19 at 00:32
0

Try this code. Should work both in Snowflake and Oracle: Create a separate group if checksum changes order by datefrom

**SNOWFLAKE**:
WITH base AS (
SELECT 
Primarykey,
   checksum,
   FIRST_VALUE( datefrom ) OVER ( PARTITION BY Primarykey ,checksum,checksum_group     ORDER BY datefrom ) AS Datefrom,
   LAST_VALUE( dateto ) OVER ( PARTITION BY Primarykey  ,checksum,checksum_group     ORDER BY datefrom ) AS Dateto,
   ROW_NUMBER() over ( PARTITION BY Primarykey  ,checksum,checksum_group ORDER BY     datefrom ) AS latest_record 
FROM(   
SELECT 
Primarykey,
   checksum,
   checksum_prev,
   datefrom,
   dateto,
   LAST_VALUE((case when checksum<>checksum_prev THEN group1 END)) IGNORE NULLS OVER     (
  ORDER BY group1
  ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) checksum_group
 FROM (
SELECT 
   Primarykey,
   checksum,
   datefrom,
   dateto,
   LAG(checksum, 1, 0) OVER (ORDER BY datefrom) AS checksum_prev,
   LPAD(1000 + ROW_NUMBER() OVER(ORDER BY (SELECT NULL)), 4, 0) as group1
FROM Datalake.user)
)
) 
SELECT * FROM base WHERE latest_record = 1

**Oracle**:
WITH base AS (
SELECT 
Primarykey,
   checksum,
   FIRST_VALUE ( datefrom ) OVER ( partition by Primarykey ,checksum,checksum_group     order by datefrom ) AS Datefrom,
   LAST_VALUE ( dateto ) OVER ( partition by Primarykey  ,checksum,checksum_group     order by datefrom ) AS Dateto,
   ROW_NUMBER() OVER ( PARTITION BY Primarykey  ,checksum,checksum_group ORDER BY     datefrom ) AS latest_record 
FROM(   
SELECT 
Primarykey,
   checksum,
   checksum_prev,
   datefrom,
   dateto,
   LAST_VALUE((CASE WHEN checksum<>checksum_prev THEN group1 END)) IGNORE NULLS 
   OVER (ORDER BY group1 ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)     checksum_group
 FROM (
SELECT 
   Primarykey,
   checksum,
   datefrom,
   dateto,
   LAG(checksum, 1, 0) OVER (ORDER BY DATEFROM) AS checksum_prev,
   LPAD(1000 + ROWNUM, 4, 0) as group1
FROM Datalake.user))) 
SELECT * FROM base WHERE latest_record = 1
MKP
  • 176
  • 3
0

I tweaked the query so it could work on entire data set. Due to missing primary key , it was failing for entire data. Modified working query

enter image description here