Aggregating neighboring rows with partitioning

Question

I have a huge data set on MS SQL 2012 where a special aggregation must be done. Here is an example of dataset.

Key PartitionID StartTime                   Duration    Name
1   1           23/05/2019 18:18:28.125     1           X   
2   1           23/05/2019 18:18:28.480     2           Y   
3   1           23/05/2019 18:18:29.622     1           X   
4   1           23/05/2019 18:18:32.513     2           X   
5   2           23/05/2019 18:21:13.973     3           X   
6   2           23/05/2019 18:21:14.945     4           X   
7   2           23/05/2019 18:21:21.949     5           X   
8   2           23/05/2019 18:21:30.871     2           X   
9   2           23/05/2019 18:21:35.710     4           X   
10  2           23/05/2019 18:21:48.550     1           X   
11  2           23/05/2019 18:22:00.144     3           X   
12  2           23/05/2019 18:22:01.094     6           X   
13  2           23/05/2019 18:22:03.354     1           X   
14  3           23/05/2019 18:24:44.219     6           X   
15  3           23/05/2019 18:24:46.076     1           Y   
16  3           23/05/2019 18:24:52.399     4           X   
17  3           23/05/2019 18:25:03.620     6           X   
18  3           23/05/2019 18:25:11.208     1           X   
19  3           23/05/2019 18:25:12.616     4           X   
20  3           23/05/2019 18:25:28.019     6           X   
21  3           23/05/2019 18:25:31.384     2           Y   
21  3           23/05/2019 18:25:32.334     2           Y   
21  3           23/05/2019 18:25:33.344     2           X

I have to create new column that is partitioning the data into sets based on Name, the CalculatedID must be different for the same Name when separated by a different Name. In other words if neighboring rows have the same Name then they also have the same CalculatedId.

The result should be similar to this:

Key PartitionID StartTime                   Duration    Name    CalculatedID
1   1           23/05/2019 18:18:28.125     1           X       1
2   1           23/05/2019 18:18:28.480     2           Y       2
3   1           23/05/2019 18:18:29.622     1           X       3
4   1           23/05/2019 18:18:32.513     2           X       3
5   2           23/05/2019 18:21:13.973     3           X       1
6   2           23/05/2019 18:21:14.945     4           X       1
7   2           23/05/2019 18:21:21.949     5           X       1
8   2           23/05/2019 18:21:30.871     2           X       1
9   2           23/05/2019 18:21:35.710     4           X       1
10  2           23/05/2019 18:21:48.550     1           X       1
11  2           23/05/2019 18:22:00.144     3           X       1
12  2           23/05/2019 18:22:01.094     6           X       1
13  2           23/05/2019 18:22:03.354     1           X       1
14  3           23/05/2019 18:24:44.219     6           X       1
15  3           23/05/2019 18:24:46.076     1           Y       2
16  3           23/05/2019 18:24:52.399     4           X       3
17  3           23/05/2019 18:25:03.620     6           X       3
18  3           23/05/2019 18:25:11.208     1           X       3
19  3           23/05/2019 18:25:12.616     4           X       3
20  3           23/05/2019 18:25:28.019     6           X       3
21  3           23/05/2019 18:25:31.384     2           Y       4
21  3           23/05/2019 18:25:32.334     2           Y       4
21  3           23/05/2019 18:25:33.344     2           X       5

I would really want to avoid looping through the data as the sets are easily over 10M.

This is a classic [tag:gaps-and-islands] problem. a good read on the subject is Itzik Ben-Gan's [Gaps and islands](https://livebook.manning.com/#!/book/sql-server-mvp-deep-dives/chapter-5) from SQL Server MVP Deep Dives — Zohar Peled, Jun 16 '19 at 10:15
Your description do not correspond to your expected output. Take three rows from start. Neighbor rows from Y have the same names, still they have a different CIDs. — user14063792468, Jun 16 '19 at 10:41
@ЯрославМашко I think that's because they have different partitionId. Note that each partitionId has it's own numbering. — Zohar Peled, Jun 16 '19 at 10:46
@ЯрославМашко as Zohar has pointed out, each partitionId represents new partition where numbering starts from scratch. — NegativePrizeWinner, Jun 16 '19 at 11:11

score 3 · Accepted Answer · answered Jun 16 '19 at 10:40

This can be done using a common table expression with lag to get the previous value for Name for each raw based on the values of PartitionId and StartTime, and then use sum as a window function to get a comulative sum of the rows where the previous name is different then the current name.

First, create and populate sample table (Please save us this step in your future questions):

DECLARE @T AS TABLE
(
    [Key] int,
    PartitionID int,
    StartTime datetime,
    Duration int,   
    Name char(1)
)

INSERT INTO @T ([Key] ,PartitionID, StartTime, Duration, Name) VALUES
(1 , 1, '2019-05-23T18:18:28.125', 1, 'X'),   
(2 , 1, '2019-05-23T18:18:28.480', 2, 'Y'),   
(3 , 1, '2019-05-23T18:18:29.622', 1, 'X'),   
(4 , 1, '2019-05-23T18:18:32.513', 2, 'X'),   
(5 , 2, '2019-05-23T18:21:13.973', 3, 'X'),   
(6 , 2, '2019-05-23T18:21:14.945', 4, 'X'),   
(7 , 2, '2019-05-23T18:21:21.949', 5, 'X'),   
(8 , 2, '2019-05-23T18:21:30.871', 2, 'X'),   
(9 , 2, '2019-05-23T18:21:35.710', 4, 'X'),   
(10, 2, '2019-05-23T18:21:48.550', 1, 'X'),   
(11, 2, '2019-05-23T18:22:00.144', 3, 'X'),   
(12, 2, '2019-05-23T18:22:01.094', 6, 'X'),   
(13, 2, '2019-05-23T18:22:03.354', 1, 'X'),   
(14, 3, '2019-05-23T18:24:44.219', 6, 'X'),   
(15, 3, '2019-05-23T18:24:46.076', 1, 'Y'),   
(16, 3, '2019-05-23T18:24:52.399', 4, 'X'),   
(17, 3, '2019-05-23T18:25:03.620', 6, 'X'),   
(18, 3, '2019-05-23T18:25:11.208', 1, 'X'),   
(19, 3, '2019-05-23T18:25:12.616', 4, 'X'),   
(20, 3, '2019-05-23T18:25:28.019', 6, 'X'),   
(21, 3, '2019-05-23T18:25:31.384', 2, 'Y'),   
(21, 3, '2019-05-23T18:25:32.334', 2, 'Y'),   
(21, 3, '2019-05-23T18:25:33.344', 2, 'X')

The common table expression:

;WITH CTE AS
(
    SELECT  [Key] ,PartitionID, StartTime, Duration, Name,
            LAG(Name) OVER(PARTITION BY PartitionID ORDER BY StartTime) As PrevName
    FROM @T
)

The query:

SELECT  [Key] ,PartitionID, StartTime, Duration, Name, 
        SUM(IIF(Name = PrevName, 0, 1)) OVER(PARTITION BY PartitionID ORDER BY StartTime) As CalculatedId
FROM CTE
ORDER BY [Key]

Results:

Key PartitionID StartTime               Duration    Name    CalculatedId
1   1           23.05.2019 18:18:28     1           X       1
2   1           23.05.2019 18:18:28     2           Y       2
3   1           23.05.2019 18:18:29     1           X       3
4   1           23.05.2019 18:18:32     2           X       3
5   2           23.05.2019 18:21:13     3           X       1
6   2           23.05.2019 18:21:14     4           X       1
7   2           23.05.2019 18:21:21     5           X       1
8   2           23.05.2019 18:21:30     2           X       1
9   2           23.05.2019 18:21:35     4           X       1
10  2           23.05.2019 18:21:48     1           X       1
11  2           23.05.2019 18:22:00     3           X       1
12  2           23.05.2019 18:22:01     6           X       1
13  2           23.05.2019 18:22:03     1           X       1
14  3           23.05.2019 18:24:44     6           X       1
15  3           23.05.2019 18:24:46     1           Y       2
16  3           23.05.2019 18:24:52     4           X       3
17  3           23.05.2019 18:25:03     6           X       3
18  3           23.05.2019 18:25:11     1           X       3
19  3           23.05.2019 18:25:12     4           X       3
20  3           23.05.2019 18:25:28     6           X       3
21  3           23.05.2019 18:25:31     2           Y       4
21  3           23.05.2019 18:25:32     2           Y       4
21  3           23.05.2019 18:25:33     2           X       5

This is exactly what I searched for! Thanks for your comments and solution! — NegativePrizeWinner, Jun 16 '19 at 11:04
Glad to help :-). Please be sure to include proper sample data in your next questions. — Zohar Peled, Jun 16 '19 at 11:19

Aggregating neighboring rows with partitioning

1 Answers1