I have data where I am trying to identify patterns from. However the data in each table isn't complete (there are missing rows). I would like to separate the table into chunks of complete data and then identify the patterns from each. I have a column where I can use to identify if the data is complete or not called sequence
.
Data will look like:
Sequence Position
1 open
2 closed
3 open
4 open
5 closed
8 closed
9 open
11 open
13 closed
14 open
15 open
18 closed
19 open
20 closed
First I'd like to split the data into complete sections:
Sequence Position
1 open
2 closed
3 open
4 open
5 closed
---------------------------
8 closed
9 open
---------------------------
11 open
---------------------------
13 closed
14 open
15 open
---------------------------
18 closed
19 open
20 closed
Then I'd like to identify the pattern closed open, ..., open, closed
such that we go from closed to open for n rows (where n is at least 1) and then back to closed
From the sample data this would leave:
Sequence Position
2 closed
3 open
4 open
5 closed
---------------------------
18 closed
19 open
20 closed
This leaves my final table where I can perform analysis on as I know there are no broken sequences. I also have another column where the position
is binary if that is easier to work with.
The tables are large, so although I think I can write loops to figure out my result, I don't think that method would be efficient enough. Alternatively I was going to pull the whole table into R
, then find the result table but this requires pulling everything into R
first so I'm wondering if this is feasible in SQL
EDIT: Different sample data that's more representative:
Sequence Position
1 open
2 closed
3 open
4 open
5 closed
8 closed
9 open
11 open
13 closed
14 open
15 open
18 closed
19 open
20 closed
21 closed
22 closed
23 closed
24 open
25 open
26 closed
27 open
Note this should have the same results but also with
23 closed
24 open
25 open
26 closed
21
, 22
and 27
are not as they dont fit the closed
, open
...,open
, closed
pattern
BUT if we had 28 closed
we would want 27
and 28
as there is no time gap and the pattern would fit. If instead of 28
it was 29 closed
we wouldnt want 27
or 29
(because although the pattern is right the sequence breaks).
To add some context, think of a machine that goes from stop, to running, to stopped. We record the data, but have gaps in the recording which here are represented by the breaking of the sequences. As well as missing data in the middle of the stop running stop cycle; the data also sometimes starts recording when the machine is already running or stops recording before the machine stops. I don't want that data as it is not a complete cycle of stop, running, stop. I only want those complete cycles, and where the sequence was continuous. This means I can transform my original data set into one with only complete cycles one after the other.