5

I am given the following table with the following problem:

Create a Slowly Changing Dimension Type 2 from the dataset. EMPLOYEE table has daily records for each employee. Type 2 - Will have effective data and expire date.

Employee ID Date Name Manager ID
123 1-Mar John Smith 1
123 2-Mar John Smith 1
123 3-Mar John Smith 2
123 4-Mar John Smith 3
123 5-Mar John Smith 3

I believe my target table is supposed to look like this:

Employee ID Name Manager ID Effective Date Expiration Date
123 John Smith 1 1-Mar 3-Mar
123 John Smith 2 3-Mar 4-Mar
123 John Smith 3 4-Mar Null

I attempted the following query:

SELECT employee_id, name, manager_id,
CASE
    WHEN LAG(manager_id) OVER() != manager_id THEN e.date 
    WHEN e.date = FIRST_VALUE(e.date) OVER() THEN e.date
    ELSE NULL
END as "Effective Date",
CASE 
    WHEN LEAD(manager_id) OVER() != manager_id THEN LEAD(e.date) OVER()
    ELSE NULL
END as "Expiration Date"
FROM employee e

My resulting table is as follows:

Employee ID Name Manager ID Effective Date Expiration Date
123 John Smith 1 1-Mar Null
123 John Smith 1 Null 3-Mar
123 John Smith 2 3-Mar 4-Mar
123 John Smith 3 4-Mar Null
123 John Smith 3 Null Null

Does anyone know of any way that I can alter my query to achieve my target table, based on what I've achieved thus far? I somehow need to only result in the 3 Manager ID's but distinct will not work. Also, I need to find a way to combine the effective date and expiration date for each manager ID. Any help at all would be greatly appreciated.

Dale K
  • 25,246
  • 15
  • 42
  • 71
LukeThomas
  • 75
  • 1
  • 1
  • 6
  • I apologize, I will quickly try and find out how to do so and correct my question. Thank you for letting me know. – LukeThomas Mar 22 '21 at 02:15
  • 1
    If you add your sample as DDL+DML you make it much easier for people to answer. – Dale K Mar 22 '21 at 02:38
  • And could your clarify, could the manager ever go back to a previous manager i.e. can the same manager be repeated at different points in time? – Dale K Mar 22 '21 at 02:40
  • I will add the sample. This was the only information I was given, so I am assuming that the employee can not go back to the same manager/can't be repeated at different points in time. And no I am not storing my dates this way. I just manually typed them in tabular form to have consistency for the question. – LukeThomas Mar 22 '21 at 02:47
  • 1
    Looks like what you need is a `GROUP BY` query, with `MIN()` and `MAX()`on `Date` column as the `Effective` and `Expiry`. `LEAD()` and `LAG()` are the wrong approach to the question – Squirrel Mar 22 '21 at 02:49
  • The MIN/MAX and GROUP BY gave me my solution. Thank you very much for your insight. I definitely overthought this one. – LukeThomas Mar 22 '21 at 04:12
  • Where is the ascending dimension key. – thebluephantom Jun 17 '22 at 21:56

5 Answers5

4

The following does what you require, and shows how to add DDL+DML as well. Its probably a bit convoluted but I can't see an obvious way to simplify it.

This solution takes into account the possibility that the manager could repeat. And it doesn't assume that ever day will exist, so if a day is missing it will still work.

declare @Test table (EmployeeID int, [Date] date, [Name] varchar(32), ManagerID int);

insert into @Test (EmployeeID, [Date], [Name], ManagerID)
values
(123, '1 Mar 2021', 'John Smith', 1),
(123, '2 Mar 2021', 'John Smith', 1),
(123, '3 Mar 2021', 'John Smith', 2),
(123, '4 Mar 2021', 'John Smith', 3),
(123, '5 Mar 2021', 'John Smith', 3);
--(123, '6 Mar 2021', 'John Smith', 2);

select EmployeeId, [Name], ManagerId, MinDate
  -- Use lead to get the last date of the next grouping - since it could in theory be more than one day on
  , lead(MinDate) over (partition by EmployeeId, [Name] order by Grouped) MaxDate
from (
  -- Get the min and max dates for a given grouping
  select EmployeeId, [Name], ManagerId, min([Date]) MinDate, max([Date]) MaxDate, Grouped
  from (
    select *
       -- Sum the change in manager to ensure that if a manager is repeated they form a different group
       , sum(Lagged) over (order by Date asc) Grouped
    from (
      select *
        -- Lag the manager to detect when it changes
        , case when lag(ManagerId,1,-1) over (order by [Date] asc) <> ManagerId then 1 else 0 end Lagged
      from @Test
    ) X
  ) Y
  group by EmployeeId, [Name], ManagerId, Grouped
) Z
order by EmployeeId, [Name], Grouped;

Returns:

EmployeeId Name ManagerId MinDate MaxDate
123 John Smith 1 2021-03-01 2021-03-03
123 John Smith 2 2021-03-03 2021-03-04
123 John Smith 3 2021-03-04 NULL
Dale K
  • 25,246
  • 15
  • 42
  • 71
  • SELECT EmployeeID, Name, ManagerID, [Date] StartDate, Lead([Date]) OVER (ORDER BY [Date]) EndDate FROM (SELECT *, lag(ManagerID,1,-1) OVER (ORDER BY [Date]) p_mgid FROM #temp ) s WHERE ManagerID <>p_mgid; ; – Sid May 29 '22 at 02:52
1

Use this, this will be simpler.

Explanation: The nested query will give the rows where there is change in managers, filter out rest of the rows as it is redundant information.

Once the data is filtered, find the next date when the manager got changed, mark that data as end date

SELECT 
   EmployeeID, 
   Name, 
   ManagerID, 
   [Date] StartDate, 
   Lead([Date]) OVER (ORDER BY [Date])  EndDate  
FROM 
    (SELECT *, lag(ManagerID,1,-1) OVER (ORDER BY [Date]) p_mgid FROM #temp ) s  
WHERE ManagerID <>p_mgid;
    ;

enter image description here

Sid
  • 118
  • 2
  • 10
0
create table test  (EmployeeID int, [Date] date, [Name] varchar(32), ManagerID int);

insert into Test (EmployeeID, [Date], [Name], ManagerID)
values
(123, '1 Mar 2021', 'John Smith', 1),
(123, '2 Mar 2021', 'John Smith', 1),
(123, '3 Mar 2021', 'John Smith', 2),
(123, '4 Mar 2021', 'John Smith', 3),
(123, '5 Mar 2021', 'John Smith', 3)

select a.employeeid,a.name,a.managerid,a.effective as effective_date,
lead(effective) over(partition by employeeid order by maximum) as expiration_date 
from (select employeeid,min(date) as effective,max(date)as maximum,name,managerid from test
group by employeeid,name,managerid) a
Suraj Rao
  • 29,388
  • 11
  • 94
  • 103
Federer-57
  • 11
  • 2
0
from pyspark.sql import SparkSession
spark=SparkSession.builder.appName('test').getOrCreate()

l1=[[123,'John Smith','1-March-2022',1],[123,'John Smith','2-March-2022',1],[123,'John Smith','3-March-2022',2],[123,'John Smith','4-March-2022',3],[123,'John Smith','5-March-2022',3]]
col=['empid','name','date','mgr']

df=spark.createDataFrame(l1,col)

df.createOrReplaceTempView('tempview')

spark.sql("""select empid,name,mgr,effective_from, lead(effective_from) over (partition by empid order by effective_from ) as effective_to from  
                (select empid,name,mgr ,min(date) as effective_from,max(date) as effective_to from tempview group by mgr,empid,name)""").show()
Dale K
  • 25,246
  • 15
  • 42
  • 71
-1
declare @Test table (EmployeeID int, [Date] date, [Name] varchar(32), ManagerID int);

insert into @Test (EmployeeID, [Date], [Name], ManagerID)
values
(123, '1 Mar 2021', 'John Smith', 1),
(123, '2 Mar 2021', 'John Smith', 1),
(123, '3 Mar 2021', 'John Smith', 2),
(123, '4 Mar 2021', 'John Smith', 3),
(123, '5 Mar 2021', 'John Smith', 3);
--(123, '6 Mar 2021', 'John Smith', 2);


;WITH CTE AS (
SELECT *,ROW_NUMBER() OVER (PARTITION BY NAME,MANAGERID ORDER BY DATE) RW FROM @TEST
)
SELECT EmployeeID,NAME,ManagerID,DATE AS FROMDATE ,(SELECT DATE FROM CTE B WHERE A.RW =B.RW AND B.ManagerID = A.ManagerID +1) AS ENDDATE
FROM CTE A 
WHERE RW=1
Jeremy Caney
  • 7,102
  • 69
  • 48
  • 77
  • While this code may solve the question, [including an explanation](//meta.stackexchange.com/q/114762) of how and why this solves the problem would really help to improve the quality of your post, and probably result in more up-votes. Remember that you are answering the question for readers in the future, not just the person asking now. Please [edit] your answer to add explanations and give an indication of what limitations and assumptions apply. – snakecharmerb Apr 13 '22 at 05:25