0

I have an interesting situation in SQL Server 2016. I am using T-SQL language.

I have a dataset called (#dataset):

enter image description here

The last column called ContinuousDates will ALWAYS have continuous date values without a gap, say Jan 1, 2021 to Dec 31, 2021. It will NEVER have duplicate dates for the same ID or Name, i.e. one person on a given day can have only one row of data. (In this example, I am showing just one person, with ID = 1 and Name = X. In my actual data, I have multiple people).

Note that NYC city occurs earlier in the dataset, and gets repeated in the last 4 rows.

I need to obtain the below dataset based on date range:

enter image description here

I tried to use a simple MINIMUM and MAXIMUM on the dataset, but I realize that at times I can get a wrong output, as below:

enter image description here

I tried some options using RANK() and DENSE_RANK() functions, but am not able to come to a solution. Can someone provide me assistance ?

I have the codes attached here:

CREATE TABLE #dataset

(

ID int,
Name varchar(20),
City varchar(20),
ContinuousDates date

)


INSERT INTO #dataset
VALUES(1,'X','NYC','1/1/2021')

INSERT INTO #dataset
VALUES(1,'X','NYC','1/2/2021')

INSERT INTO #dataset
VALUES(1,'X','NYC','1/3/2021')

INSERT INTO #dataset
VALUES(1,'X','SFO','1/4/2021')

INSERT INTO #dataset
VALUES(1,'X','SFO','1/5/2021')

INSERT INTO #dataset
VALUES(1,'X','PHY','1/6/2021')

INSERT INTO #dataset
VALUES(1,'X','PHY','1/7/2021')

INSERT INTO #dataset
VALUES(1,'X','PHY','1/8/2021')

INSERT INTO #dataset
VALUES(1,'X','NYC','1/9/2021')

INSERT INTO #dataset
VALUES(1,'X','NYC','1/10/2021')

INSERT INTO #dataset
VALUES(1,'X','NYC','1/11/2021')

INSERT INTO #dataset
VALUES(1,'X','NYC','1/12/2021')




SELECT *
FROM #dataset
ORDER BY ContinuousDates

I have a new set of codes, for a better demonstration:

CREATE TABLE #dataset

(

ID int,
Name varchar(20),
City varchar(20),
ContinuousDates date

)


INSERT INTO #dataset
VALUES(1,'X','NYC','1/1/2021')

INSERT INTO #dataset
VALUES(1,'X','NYC','1/2/2021')

INSERT INTO #dataset
VALUES(1,'X','NYC','1/3/2021')

INSERT INTO #dataset
VALUES(1,'X','SFO','1/4/2021')

INSERT INTO #dataset
VALUES(1,'X','SFO','1/5/2021')

INSERT INTO #dataset
VALUES(1,'X','PHY','1/6/2021')

INSERT INTO #dataset
VALUES(1,'X','PHY','1/7/2021')

INSERT INTO #dataset
VALUES(1,'X','PHY','1/8/2021')

INSERT INTO #dataset
VALUES(1,'X','NYC','1/9/2021')

INSERT INTO #dataset
VALUES(1,'X','NYC','1/10/2021')

INSERT INTO #dataset
VALUES(1,'X','NYC','1/11/2021')

INSERT INTO #dataset
VALUES(1,'X','NYC','1/12/2021')

INSERT INTO #dataset
VALUES(2,'Y','MEL','1/13/2021')

INSERT INTO #dataset
VALUES(3,'Z','SYD','1/14/2021')

INSERT INTO #dataset
VALUES(3,'Z','SYD','1/15/2021')

INSERT INTO #dataset
VALUES(3,'Z','PER','1/16/2021')

INSERT INTO #dataset
VALUES(4,'A',NULL,'1/16/2021')

INSERT INTO #dataset
VALUES(4,'A', NULL,'1/17/2021')



SELECT *
FROM #dataset
ORDER BY ID, ContinuousDates
user3812887
  • 439
  • 12
  • 33

2 Answers2

2

This is a type of gaps-and-islands problem.

There are a number of different solutions. Here is one simple one

  • Use LAG to identify rows that start each island
  • A running conditional count gives us an ID for each island
  • Then simply group up by that ID (along with any other partition columns)
WITH StartPoints AS (
    SELECT *,
      IsStart = CASE WHEN LAG(City, 1, '') OVER (PARTITION BY ID ORDER BY ContinuousDates)
                         <> City THEN 1 END
    FROM #dataset ds
),
Groups AS (
    SELECT *,
      GroupId = COUNT(IsStart) OVER (PARTITION BY ID ORDER BY ContinuousDates ROWS UNBOUNDED PRECEDING)
    FROM StartPoints
)
SELECT
  ID,
  Name,
  City = MIN(City),
  DateStart = MIN(ContinuousDates),
  DateEnd = MAX(ContinuousDates)
FROM Groups
GROUP BY
  ID,
  Name,
  GroupId;

db<>fiddle

Charlieface
  • 52,284
  • 6
  • 19
  • 43
  • thanks very much for your response; what if there are more than 1 column of the type 'City' ? In my original data, I have City, State, Manager, Department, etc. a good number of columns. And I have many people, not just one person called X. I have X,Y,Z,... Also, is there a way, I can develop this solution using ad-hoc query only, without the use of a CTE ? – user3812887 Nov 10 '21 at 20:19
  • CTE's are ad-hoc queries, not sure what you mean. You can move them into derived tables (subqueries) if it makes you feel better, but then you will have doubly nested tables. If you want to partition by all those columns, simply add them all to the `PARTITION BY` and `GROUP BY` clauses. So you would do `PARTITION BY ID, Name, City, State, Manager, Department ORDER BY ...` – Charlieface Nov 10 '21 at 20:29
  • Do I need to test other columns such as Country, Manager, Department too, using LAG, as below? LAG(City, 1, '') OVER (PARTITION BY ID ORDER BY ContinuousDates) <> City – user3812887 Nov 10 '21 at 20:31
  • Can you just modify IsStart column in the StartPoints CTE, if I have three columns: City, Department, Manager ? If you just provide an outline, I will absorb it. – user3812887 Nov 10 '21 at 20:34
  • Depends what you are trying to do. If you want to calculate the islands for each `City, State, Manager` group separately (ie as if they were each in their own table), then do as I said. If you want restart a new island for each change of those columns (so you are looking at them as an actual change) then you need to check a new `LAG` for each. I suggest you create a *new* question with your new data, rather than modifying this one. – Charlieface Nov 10 '21 at 20:34
  • I am trying to track multiple columns - City, Country, Manager, Department, etc. - all at once, over a period of time, for all the employees of my company . In my example above, I put only City, to keep it simple. So based on your reply, I need to add new LAG function for each of my columns, am I right? – user3812887 Nov 10 '21 at 20:39
  • Sounds like it yes. But hard to say without seeing sample data and expected output – Charlieface Nov 10 '21 at 20:40
  • LAG(City, 1, '') OVER (PARTITION BY ID ORDER BY ContinuousDates) <> City AND LAG(Country, 1, '') OVER (PARTITION BY ID ORDER BY ContinuousDates) <> Country AND LAG(Manager, 1, '') OVER (PARTITION BY ID ORDER BY ContinuousDates) <> Manager AND LAG(Department, 1, '') OVER (PARTITION BY ID ORDER BY ContinuousDates) <> Department – user3812887 Nov 10 '21 at 20:41
  • 1
    Change those `AND` to `OR` I think – Charlieface Nov 10 '21 at 20:42
  • I have added LAG function for multiple columns using AND operator. IS this right ? – user3812887 Nov 10 '21 at 20:42
  • thanks very much appreciate it; am inching closer using the OR operator. I have a dozen columns in my actual data, for than 1000 people. Once i am done, i will mark your post as solution – user3812887 Nov 10 '21 at 21:03
  • How are you? I used the other solution using the ROW NUMBER option posted by the other person last time. But, today I have a new client who has a situation in which ID = 1 (for Person X) can have 2 rows for Jan 1. Say, ID = 1, Name = 'X', City = 'Charlotte', Jan 1, 2021. This new row is on top of what is posted in the first image of this question. So, person X, with ID = 1, will have 2 cities (NYC and Charlotte) on Jan 1, 2021. The ROW NUMBER option is not working here. So am trying your approach with LAG function. Will your solution work ? – user3812887 Nov 25 '21 at 22:43
  • Again, it depends what you are trying to do: if you want a new island to restart on that change then you need another `LAG`. However if you want each separate `ID, City` pair to be calculated separately, and merge with the next rows for the same `ID, City` (as if it were a completely different set of data from other `ID, City` values) then you need to add it to the `PARTITION BY` and *not* use another `LAG` – Charlieface Nov 25 '21 at 23:48
  • thanks; after spending some time, i figured out just now, with some extra columns in the PARTITION BY clause, am getting close to the solution – user3812887 Nov 26 '21 at 00:31
  • I have a new question here. Would it be possible to look into it ?https://stackoverflow.com/questions/70161064/microsoft-sql-server-2016-t-sql-language-gaps-islands-problem-tricky-p – user3812887 Nov 29 '21 at 21:05
2

Solution steps:

  • numbers sections with ID and Name sorted by date (row_id)
  • numbers sections with ID, Name and City sorted by date (p_row_id)
  • calculate row_id - p_row_id

Now you have group numbers for each period within unique set of values.

All that you need is to group by this number, ID, Name and City

ID Name City ContinuousDates p_row_id row_id row_id - p_row_id
1 X NYC 2021-01-01 1 1 0
1 X NYC 2021-01-02 2 2 0
1 X NYC 2021-01-03 3 3 0
1 X SFO 2021-01-04 1 4 3
1 X SFO 2021-01-05 2 5 3
1 X PHY 2021-01-06 1 6 5
1 X PHY 2021-01-07 2 7 5
1 X PHY 2021-01-08 3 8 5
1 X NYC 2021-01-09 4 9 5
1 X NYC 2021-01-10 5 10 5
1 X NYC 2021-01-11 6 11 5
1 X NYC 2021-01-12 7 12 5
select
     CD.ID
    ,CD.[Name]
    ,CD.City
    ,min(CD.ContinuousDates) as DateStart
    ,max(CD.ContinuousDates) as DateEnd
from
    (
        select *
            ,row_number() over(partition by CD.ID, CD.[Name], CD.City order by CD.ContinuousDates) as p_row_id
            ,row_number() over(partition by CD.ID, CD.[Name] order by CD.ContinuousDates) as row_id
        from #dataset CD
    ) CD
group by CD.row_id - CD.p_row_id
        ,CD.ID
        ,CD.[Name]
        ,CD.City
order by DateStart

template for multiple column:

select
     CD.GroupColumn1
    ,CD.GroupColumn2
    ..
    ,CD.Column1
    ,CD.Column2
    ,CD.Column3
    ,CD.Column4
    ..
    ,min(CD.ContinuousDates) as DateStart
    ,max(CD.ContinuousDates) as DateEnd
from
    (
        select *
            ,row_number() over(partition by
                 CD.GroupColumn1
                ,CD.GroupColumn2
                ..
                ,CD.Column1
                ,CD.Column2
                ,CD.Column3
                ,CD.Column4
                ..
                order by CD.ContinuousDates) as p_row_id
            ,row_number() over(partition by
                 CD.GroupColumn1
                ,CD.GroupColumn2
                ..
                order by CD.ContinuousDates) as row_id
        from #dataset CD
    ) CD
group by CD.row_id - CD.p_row_id
        ,CD.GroupColumn1
        ,CD.GroupColumn2
        ..
         CD.Column1
        ,CD.Column2
        ,CD.Column3
        ,CD.Column4
        ..
order by DateStart
  • I have a problem when i use the code with many employees, say X,Y,Z,...... The difference value [CD.row_id - CD.p_row_id] is not coming properly. Should I add any extra column for Partition in the row_id column? (Your above solution works perfectly fine, if I try one employee at a time) – user3812887 Nov 11 '21 at 16:00
  • For multiple employees (ID = 1,2,3,4..), would this be right ? row_number() over(partition by CD.ID order by CD.ContinuousDates) as row_id – user3812887 Nov 11 '21 at 16:16
  • I just added query that you can use as a template for multiple column. any extra column should be added to "select", "group by" and "partition by" – Павел Сивоплясов Nov 11 '21 at 16:39
  • I got this part adding multiple columns; say I have an employee X, employee Y, employee Z, each with a unique ID, the minor change needed in the code above is with the row_id column: row_number() over(partition by CD.ID order by CD.ContinuousDates) as row_id. that is the column called row_id MUST have a partition by ID; (currently it only has an ORDER BY clause for CD.ContinuousDates). I just cheked this – user3812887 Nov 11 '21 at 17:11
  • everything else looks great; thanks very much – user3812887 Nov 11 '21 at 17:12
  • I have a lot of NULLs under many columns in my actual dataset, to let you know – user3812887 Nov 11 '21 at 17:20
  • actually, pratition for row_id can be any depending on the logic. only with one condition: it should be subclass of p_row_id (another words it shouldn't be wider than partition for p_row_id). "row_id - p_row_id" is simply a number that indicate where you haven't gaps within unique data of field set specified for p_row_id partition. – Павел Сивоплясов Nov 11 '21 at 17:22
  • I have again added a new set of INSERT statements; if you try your solution (without Partition By) for row_id column, you will get two rows for ID = 4 (i.e. Employee 'A') on Jan 16 and Jan 17; if you add Partition By clause for the ID column of my dataset in the row_id column, this will be corrected. Please try and let me know – user3812887 Nov 11 '21 at 20:27
  • 1
    I get it. That's my bad. You've mentioned that dates will never have duplicate for the same ID or Name. I don't know how I missed that. You also need to include Name to row_id partition if it isn't a one-to-one relationship. – Павел Сивоплясов Nov 11 '21 at 21:19
  • Name and ID are obviously 1-1 relationship, so no issue there; an ID uniquely identifies a person's name – user3812887 Nov 11 '21 at 21:24
  • In that case, ID will be enought. I'm so sorry. That was totally my fault. – Павел Сивоплясов Nov 11 '21 at 21:52
  • I have another client, where there are duplicates for days; one employee can have multiple rows for the same date. Can you help me, if I post a new question ? – user3812887 Nov 25 '21 at 22:26
  • I have a new question here. Would it be possible to look into it ?https://stackoverflow.com/questions/70161064/microsoft-sql-server-2016-t-sql-language-gaps-islands-problem-tricky-p – user3812887 Nov 29 '21 at 21:06