1

I have an interesting grouping case for which I really can't figure the best way to proceed with. I'm dealing with a lot of data so I'm trying to find the most effective way to perform my query.

Here's an example of data I could have

|Line      |Trip          |Speed        |Timestamp
====================================================
|100       |1             |50 kmh       |2017-06-12 22:34:50
|100       |1             |55 kmh       |2017-06-12 22:36:44
|100       |1             |56 kmh       |2017-06-12 22:37:12
|200       |5             |12 kmh       |2017-06-12 22:40:11
|200       |5             |18 kmh       |2017-06-12 22:43:13
|100       |1             |23 kmh       |2017-06-12 22:49:11
|100       |1             |45 kmh       |2017-06-12 22:53:49

I would like to be able to define a grouping number sequentially based on the Line & Trip ordered by the timestamp. So basically at the end I would expect something such as:

|Line      |Trip          |Speed        |Timestamp            |GroupNumber
===========================================================================
|100       |1             |50 kmh       |2017-06-12 22:34:50  |  1
|100       |1             |55 kmh       |2017-06-12 22:36:44  |  1
|100       |1             |56 kmh       |2017-06-12 22:37:12  |  1
|200       |5             |12 kmh       |2017-06-12 22:40:11  |  2
|200       |5             |18 kmh       |2017-06-12 22:43:13  |  2
|100       |1             |23 kmh       |2017-06-12 22:49:11  |  3
|100       |1             |45 kmh       |2017-06-12 22:53:49  |  3

I can't find any way using DENSE_RANK or ROW_NUMBER because the last group of 100/1 is merged again with the first records but should not because there is another group (200/5) between both occurence.

Any help would be appreciated. Thank you.

Chris
  • 13
  • 2

1 Answers1

3

This is known as a Gaps and Islands problem.

To identify your islands, you need to use two ranking functions, the first to find the position of a row within the set, and the second to find the position of the row within its subset (Line, Trip), So you would end up with:

SELECT  Line,
        Trip,
        Speed,
        Timestamp,
        R1 = ROW_NUMBER() OVER(ORDER BY Timestamp),
        R2 = ROW_NUMBER() OVER(PARTITION BY Line, Trip ORDER BY Timestamp)
FROM    dbo.YourTable;


|Line      |Trip          |Speed        |Timestamp              R1      R2
==========================================================================
|100       |1             |50 kmh       |2017-06-12 22:34:50    1       1
|100       |1             |55 kmh       |2017-06-12 22:36:44    2       2
|100       |1             |56 kmh       |2017-06-12 22:37:12    3       3
|200       |5             |12 kmh       |2017-06-12 22:40:11    4       1   <-- starts new sequence for new group
|200       |5             |18 kmh       |2017-06-12 22:43:13    5       2
|100       |1             |23 kmh       |2017-06-12 22:49:11    6       4   <-- follows on from where it left off
|100       |1             |45 kmh       |2017-06-12 22:53:49    7       5

Now, if you deduct one from the other you get a unique identifier for each island

|Line      |Trip          |Speed        |Timestamp              R1      R2  (R1 - R2)
=======================================================================================
|100       |1             |50 kmh       |2017-06-12 22:34:50    1       1       0
|100       |1             |55 kmh       |2017-06-12 22:36:44    2       2       0
|100       |1             |56 kmh       |2017-06-12 22:37:12    3       3       0
|200       |5             |12 kmh       |2017-06-12 22:40:11    4       1       3
|200       |5             |18 kmh       |2017-06-12 22:43:13    5       2       3
|100       |1             |23 kmh       |2017-06-12 22:49:11    6       4       2   
|100       |1             |45 kmh       |2017-06-12 22:53:49    7       5       2

Finally, you can use this unique identifier to get a starting time for each group:

SELECT  Line,
        Trip,
        Speed,
        Timestamp,
        GroupStart = MIN(Timestamp) OVER(PARTITION BY Line, Trip, IslandID)
FROM    (   SELECT  Line,
                    Trip,
                    Speed,
                    Timestamp,
                    IslandID = ROW_NUMBER() OVER(ORDER BY Timestamp) -
                                ROW_NUMBER() OVER(PARTITION BY Line, Trip 
                                                    ORDER BY Timestamp)
            FROM    dbo.YourTable
        ) AS t;

Then finally, you can apply DENSE_RANK() to the group start time to get an integer ranking. So with your sample data you would get:

-- SAMPLE DATA
DECLARE @T TABLE (Line INT, Trip INT, Speed VARCHAR(6), Timestamp DATETIME);
INSERT @T (Line, Trip, Speed, Timestamp)
VALUES
    (100, 1, '50 kmh', '2017-06-12 22:34:50'),
    (100, 1, '55 kmh', '2017-06-12 22:36:44'),
    (100, 1, '56 kmh', '2017-06-12 22:37:12'),
    (200, 5, '12 kmh', '2017-06-12 22:40:11'),
    (200, 5, '18 kmh', '2017-06-12 22:43:13'),
    (100, 1, '23 kmh', '2017-06-12 22:49:11'),
    (100, 1, '45 kmh', '2017-06-12 22:53:49');

WITH GroupedData AS
(   SELECT  Line,
            Trip,
            Speed,
            Timestamp,
            GroupStart = MIN(Timestamp) OVER(PARTITION BY Line, Trip, IslandID),
            IslandID
    FROM    (   SELECT  Line,
                        Trip,
                        Speed,
                        Timestamp,
                        IslandID = ROW_NUMBER() OVER(ORDER BY Timestamp) -
                                    ROW_NUMBER() OVER(PARTITION BY Line, Trip 
                                                        ORDER BY Timestamp)
                FROM    @T
            ) AS t
) 
SELECT  Line,
        Trip,
        Speed,
        Timestamp,
        GroupNumber = DENSE_RANK() OVER(ORDER BY GroupStart, IslandID)
FROM    GroupedData
ORDER BY Timestamp;

OUTPUT

Line    Trip    Speed   Timestamp                   GroupNumber
--------------------------------------------------------------
100     1       50 kmh  2017-06-12 22:34:50.000     1
100     1       55 kmh  2017-06-12 22:36:44.000     1
100     1       56 kmh  2017-06-12 22:37:12.000     1
200     5       12 kmh  2017-06-12 22:40:11.000     2
200     5       18 kmh  2017-06-12 22:43:13.000     2
100     1       23 kmh  2017-06-12 22:49:11.000     3
100     1       45 kmh  2017-06-12 22:53:49.000     3
GarethD
  • 68,045
  • 10
  • 83
  • 123
  • 1
    (R1-R2) for the first three rows is 0 not 1 (but obviously this doesn't change the correctness of answer) – etsa Jul 05 '17 at 12:43
  • @etsa Thanks. I have corrected that and hung my head in shame at such a basic arithmetic error – GarethD Jul 05 '17 at 12:49
  • This is perfect. Actually, I don't need the extra portion with DENSE_RANK, your initial ROW_NUMBER minus ROW_NUMBER perfectly identify my groups and that's all I need. Thank you very much! – Chris Jul 05 '17 at 12:49