4

I have data where I am trying to identify patterns from. However the data in each table isn't complete (there are missing rows). I would like to separate the table into chunks of complete data and then identify the patterns from each. I have a column where I can use to identify if the data is complete or not called sequence.

Data will look like:

Sequence      Position 
1              open
2              closed 
3              open
4              open
5              closed
8              closed
9              open
11             open
13             closed
14             open 
15             open
18             closed
19             open
20             closed

First I'd like to split the data into complete sections:

   Sequence      Position 
    1              open
    2              closed 
    3              open
    4              open
    5              closed
---------------------------
    8              closed
    9              open
---------------------------
    11             open
---------------------------
    13             closed
    14             open 
    15             open
---------------------------
    18             closed
    19             open
    20             closed

Then I'd like to identify the pattern closed open, ..., open, closed such that we go from closed to open for n rows (where n is at least 1) and then back to closed

From the sample data this would leave:

     Sequence        Position 
        2              closed 
        3              open
        4              open
        5              closed
    ---------------------------
        18             closed
        19             open
        20             closed

This leaves my final table where I can perform analysis on as I know there are no broken sequences. I also have another column where the position is binary if that is easier to work with.

The tables are large, so although I think I can write loops to figure out my result, I don't think that method would be efficient enough. Alternatively I was going to pull the whole table into R, then find the result table but this requires pulling everything into R first so I'm wondering if this is feasible in SQL

EDIT: Different sample data that's more representative:

Sequence      Position 
    1              open
    2              closed 
    3              open
    4              open
    5              closed
    8              closed
    9              open
    11             open
    13             closed
    14             open 
    15             open
    18             closed
    19             open
    20             closed
    21             closed
    22             closed
    23             closed
    24             open
    25             open
    26             closed
    27             open

Note this should have the same results but also with

    23             closed
    24             open
    25             open
    26             closed

21, 22 and 27 are not as they dont fit the closed, open...,open, closed pattern

BUT if we had 28 closed we would want 27 and 28 as there is no time gap and the pattern would fit. If instead of 28 it was 29 closed we wouldnt want 27 or 29 (because although the pattern is right the sequence breaks).

To add some context, think of a machine that goes from stop, to running, to stopped. We record the data, but have gaps in the recording which here are represented by the breaking of the sequences. As well as missing data in the middle of the stop running stop cycle; the data also sometimes starts recording when the machine is already running or stops recording before the machine stops. I don't want that data as it is not a complete cycle of stop, running, stop. I only want those complete cycles, and where the sequence was continuous. This means I can transform my original data set into one with only complete cycles one after the other.

Olivia
  • 814
  • 1
  • 14
  • 26

2 Answers2

2

I think there is actually a relatively simple way to look at this. You can identify the closing sequence number by:

  • Looking at the sequence of the previous close
  • Looking at the cumulative opens for the previous close and the current close
  • Doing arithmetic to be sure all the intermediates are in the data

This turns into a query:

select t.*,
       lag(sequence) over (partition by position order by sequence) as prev_sequence,
       lag(cume_opens) over (partition by position order by cume_opens) as prev_cume_opens
from (select t.*,
             sum(case when position = 'open' then 1 else 0 end) over (order by sequence) as cume_opens
      from t
     ) t
where position = 'close' and
      (cume_opens - prev_cume_opens) = sequence - prev_sequence - 1 and
      sequence > prev_sequence - 1;

Now that you have identified the sequences, you can join back to get the original rows:

select t.*
from t join
     (select t.*,
             lag(sequence) over (partition by position order by sequence) as prev_sequence,
             lag(cume_opens) over (partition by position order by cume_opens) as prev_cume_opens
      from (select t.*,
                   sum(case when position = 'open' then 1 else 0 end) over (order by sequence) as cume_opens
            from t
           ) t
      where position = 'close' and
            (cume_opens - prev_cume_opens) = sequence - prev_sequence - 1 and
            sequence > prev_sequence - 1
     ) seqs
     on t.sequence between seqs.prev_sequence and seqs.sequence;

I admit that I haven't tested this. I do think the idea works, however. The one thing is that it will choose multiple "close" periods per sequence group.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
2

You can use it.

DECLARE @MyTable TABLE (Sequence INT, Position VARCHAR(10))

INSERT INTO @MyTable
VALUES
(1,'open'),
(2,'closed') ,
(3,'open'),
(4,'open'),
(5,'closed'),
(8,'closed'),
(9,'open'),
(11,'open'),
(13,'closed'),
(14,'open') ,
(15,'open'),
(18,'closed'),
(19,'open'),
(20,'closed'),
(21,'closed'),
(22,'closed'),
(23,'closed'),
(24,'open'),
(25,'open'),
(26,'closed'),
(27,'open')


;WITH CTE AS(
    SELECT * ,
        CASE WHEN Position ='closed' AND LAG(Position) OVER(ORDER BY [Sequence]) ='closed' THEN 1 ELSE 0 END CloseMark
    FROM @MyTable
)
,CTE_2 AS 
(
    SELECT 
        [New_Sequence] = [Sequence] + (SUM(CloseMark) OVER(ORDER BY [Sequence] ROWS UNBOUNDED PRECEDING )) 
        , [Sequence]
        , Position
     FROM CTE
)
,CTE_3 AS (
    SELECT *, 
    RN = ROW_NUMBER() OVER(ORDER BY [New_Sequence]) 
    FROM CTE_2
)
,CTE_4 AS
(
    SELECT ([New_Sequence] - RN) G
    , MIN(CASE WHEN Position = 'closed' THEN [Sequence] END) MinCloseSq
    , MAX(CASE WHEN Position = 'closed' THEN [Sequence] END) MaxCloseSq
    FROM CTE_3 
    GROUP BY ([New_Sequence] - RN)
)
SELECT
    CTE.Sequence, CTE.Position
FROM CTE_4 
    INNER JOIN CTE  ON (CTE.Sequence BETWEEN CTE_4.MinCloseSq AND CTE_4.MaxCloseSq)
WHERE
    CTE_4.MaxCloseSq > CTE_4.MinCloseSq
    AND (CTE_4.MaxCloseSq IS NOT NULL AND CTE_4.MinCloseSq IS NOT NULL)

Result:

Sequence    Position
----------- ----------
2           closed
3           open
4           open
5           closed
---         ---
18          closed
19          open
20          closed
---         ---
23          closed
24          open
25          open
26          closed
Serkan Arslan
  • 13,158
  • 4
  • 29
  • 44
  • This doesnt seem to work on my real data. My data has much longer periods of closed and or open that repeat. But the format is the same. Whats going on? - Im talking like 1000 closed then thousand open etc. – Olivia Oct 06 '17 at 08:59
  • Can you add more much data for testing? – Serkan Arslan Oct 06 '17 at 09:00
  • Im sorry, ive noticed its my data thats the issue. Im creating sequence using `round(((round(convert(float , datetime),5)- 42961.58227)* 99999.97 + 1),1)` but noticed some duplicate/strange dates so im just going to remove them and try again - thanks for the quick response though – Olivia Oct 06 '17 at 09:30
  • the two numbers are the first date in the table and then 1/the difference between two datetimes (which differ in one second). Thats just forcing `2017-08-16 13:58:27.837` into 1 and `2017-08-16 13:58:28.837` into 2 etc – Olivia Oct 06 '17 at 09:36
  • 1
    It is an interesting solution for calculating the second difference. Why you just don't use datediff function like `datediff(second, start_date, date)+1` – Serkan Arslan Oct 06 '17 at 10:51
  • its been a long week haha – Olivia Oct 06 '17 at 11:14
  • Please send me duplicate sequence produced date samples. – Serkan Arslan Oct 06 '17 at 11:33
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/156094/discussion-between-olivia-and-sarslan). – Olivia Oct 06 '17 at 11:34