SQL: selecting rows where column value changed from previous row

Question

Let's say I have this (MySQL) database, sorted by increasing timestamp:

Timestamp   System StatusA StatusB 
2011-01-01     A      Ok     Ok      
2011-01-02     B      Ok     Ok     
2011-01-03     A     Fail   Fail     
2011-01-04     B      Ok    Fail     
2011-01-05     A     Fail    Ok      
2011-01-06     A      Ok     Ok      
2011-01-07     B     Fail   Fail

How do I select the rows where StatusA changed from the previous row for that system? StatusB doesn't matter (I show it in this question only to illustrate that there may be many consecutive rows for each system where StatusA doesn't change). In the example above, the query should return the rows 2011-01-03 (StatusA changed between 2011-01-01 and 2011-01-03 for SystemA), 2011-01-06, 2011-01-07.

The query should execute quickly with the table having tens of thousands of records.

Thanks

@Andriy: 2011-01-04 is not returned because Status A is OK on both 2011-01-02 and 2011-01-04 (both for system B). — Jonathan Leffler, Jul 03 '11 at 00:21

ypercubeᵀᴹ · Accepted Answer · 2011-07-04T02:06:53.767

45

SELECT a.*
FROM tableX AS a
WHERE a.StatusA <>
      ( SELECT b.StatusA
        FROM tableX AS b
        WHERE a.System = b.System
          AND a.Timestamp > b.Timestamp
        ORDER BY b.Timestamp DESC
        LIMIT 1
      )

But you can try this as well (with an index on (System,Timestamp):

SELECT System, Timestamp, StatusA, StatusB
FROM
  ( SELECT (@statusPre <> statusA AND @systemPre=System) AS statusChanged
         , System, Timestamp, StatusA, StatusB
         , @statusPre := StatusA
         , @systemPre := System
    FROM tableX
       , (SELECT @statusPre:=NULL, @systemPre:=NULL) AS d
    ORDER BY System
           , Timestamp
  ) AS good
WHERE statusChanged ;

edited Jul 04 '11 at 02:06

answered Jul 03 '11 at 22:38

ypercubeᵀᴹ

113,259
19
174
235

The first query executes in about 13 seconds. There are a bit less than 5000 records in the database. – Jimmy Jul 04 '11 at 00:58
It can't be using indexes if 5K records aren't instantaneous. What do your indexes look like? – dkretz Jul 04 '11 at 01:16
2

There's an index on System, StatusA, TimeStamp... What would you like to know when you ask "what the indexes look like". They're purple with pink polka dots :) – Jimmy Jul 04 '11 at 01:25
@ypercube: I don't know what those @, := do -- so I have to figure that out before I try the 2nd one :) – Jimmy Jul 04 '11 at 01:27
Put an index on TimeStamp only. – dkretz Jul 04 '11 at 01:30
@dorfier: I removed the other indices, and it still takes 13 seconds. – Jimmy Jul 04 '11 at 01:43
@ypercube: hmm, mysql doesn't seem to like , (@statusPre:=NULL, @systemPre:=NULL) AS d – Jimmy Jul 04 '11 at 01:54
@Jimmy: What error do you get? I tested it and runs in about 0.16 seconds with more than 50K rows. – ypercubeᵀᴹ Jul 04 '11 at 01:58
@Jimmy: Oops sorry, something went wrong during copy-paste. It needs a SELECT there. – ypercubeᵀᴹ Jul 04 '11 at 02:06
2

Sweet! The second query executed in 0.0172 secs! – Jimmy Jul 04 '11 at 02:10
The first query worked great for me, after I added a b-tree index on the timestamp. Thanks! – you786 Sep 21 '12 at 17:23

score 12 · Answer 2 · edited Oct 20 '12 at 12:15

12

Use rownum

I've got 0.05 seconds on 20000 rows

select a1.*
  from (select rownum R_NUM, TIMESTAMP, System, StatusA from TableX) a1 
  join (select rownum R_NUM, TIMESTAMP, SYSTEM, STATUSA from TABLEX) a2 
    on a1.R_NUM = a2.R_NUM+1 
 where a1.system = a2.system 
   and a1.StatusA != a2.StatusA

edited Oct 20 '12 at 12:15

Ben

51,770
36
127
149

answered Jun 01 '12 at 15:12

Egor

121
1
3

1

The question is tagged `mysql` and rownum is only applicable to Oracle Dbs. Is there a mysql equivalent to this? – Patrick Jul 14 '17 at 06:55
1

`a1.R_NUM = a2.R_NUM+1 ` sounds like it would fail to capture the last row, and the case where there is exactly one row. – Gili Jan 31 '21 at 16:27

Jiri · Answer 3 · 2011-07-03T07:14:41.313

11

select a.Timestamp, a.System, a.StatusA, a.StatusB
from tableX as a
cross join tableX as b
where a.System = b.System
and a.Timestamp > b.Timestamp
and not exists (select * 
    from tableX as c
    where a.System = c.System
    and a.Timestamp > c.Timestamp
    and c.Timestamp > b.Timestamp
)
and a.StatusA <> b.StatusA;

Update addressing a comment: Why not use an inner join instead of a cross join?

The question asks for a MySQL solution. According to the documentation:

In MySQL, CROSS JOIN is a syntactic equivalent to INNER JOIN (they can replace each other). In standard SQL, they are not equivalent. INNER JOIN is used with an ON clause, CROSS JOIN is used otherwise.

This means that either of these joins would work.

The conditional_expr used with ON is any conditional expression of the form that can be used in a WHERE clause. Generally, you should use the ON clause for conditions that specify how to join tables, and the WHERE clause to restrict which rows you want in the result set.

The condition a.System = b.System probably falls under the 'how to join tables' category so using an INNER JOIN would be nicer in this case.

Since both produce the same results, the difference might be in performance. To say which will be faster I would need to know how are the joins implemented internally - whether they use indexes or hashing to do the joining.

edited Jul 03 '11 at 07:14

answered Jul 02 '11 at 23:33

Jiri

2,206
1
22
19

2

`A cross join B where A.x = B.x`? Why not `A inner join B on A.x = B.x`? Otherwise, it's nice indeed! (+1) – Andriy M Jul 03 '11 at 00:31
1

I'm sure MySQL is smart enough to notice the similarity between `CROSS JOIN + WHERE join_condition` and `INNER JOIN … ON join_condition`. INNER JOIN just seemed more natural to me, thence my question. Otherwise I'm of the opinion that your original version is as correct and as efficient as the revised one. Thanks for taking the trouble of adding this extended explanation to your answer! – Andriy M Jul 03 '11 at 08:33
@Andriy: I think that syntactic equivalence means that `CROSS JOIN ... WHERE join_condition` will produce same plan as `JOIN ... WHERE join_condition`. And `CROSS JOIN ... ON join_condition` (not standard SQL) the same as `JOIN ... ON join_condition`. I agree that it will be smart enough to have all 4 versions producing same plan. – ypercubeᵀᴹ Jul 03 '11 at 09:03
Hmm, I can't get this query to complete either; same result as le dorfier's suggestion. Entering it in phpmyadmin just results in a long wait, and eventually phpmyadmin returns to home screen. This is on a Host Gator account, which is usually quite quick. – Jimmy Jul 03 '11 at 21:23
@Jimmy: try the queries first with tables having small number of rows, not more than 20. And you should add a `(System, Timestamp)` index for big tables. – ypercubeᵀᴹ Jul 03 '11 at 22:04
1

I don't know why this answer is getting so many upvotes -- ypercube's answer (with the @ variables) is by far the fastest, at least according to my tests. – Jimmy Apr 26 '12 at 13:00

score 3 · Answer 4 · answered Jul 18 '22 at 08:00

This is exactly what windowing functions as built for. LAG will give the exact answer:

create table t1 (ts date, sys char(1),stata varchar(10),statb varchar(10));

insert into t1 values
('2011-01-01','A',' Ok','Ok'),
('2011-01-02','B',' Ok','Ok'),
('2011-01-03','A','Fail','Fail'),
('2011-01-04','B',' Ok','Fail'),
('2011-01-05','A','Fail','Ok'),
('2011-01-06','A',' Ok','Ok'),
('2011-01-07','B','Fail','Fail');

select * from (
select ts,sys,stata,lag(stata) over(partition by sys order by ts asc) as prev from t1
) as subsel where stata!=prev

score 2 · Answer 5 · answered Jan 22 '13 at 09:17

2

Egor's answer worked for me in MSSQL with one small change. Had to replace the ROWNUM statements with:

select row_number () over (order by TIMESTAMP) as R_NUM, ...

answered Jan 22 '13 at 09:17

Allen

434
3
13

score 2 · Answer 6 · answered May 17 '19 at 15:07

SELECT   a.*
FROM    (select row_number() over (partition by System order by Timestamp asc) as aRow, Timestamp, System, StatusA, StatusB from tableX) as a
left join (select row_number() over (partition by System order by Timestamp asc) as bRow, Timestamp, System, StatusA, StatusB from tableX) as b on a.aRow = b.bRow + 1 and a.System = b.System 
where (a.StatusA != b.StatusA or b.StatusA is null)

It will return first row and rows where value is different.

dkretz · Answer 7 · 2011-07-03T17:45:34.417

2

Here's a slightly shorter version with similar logic. I've tested this so often I'm sure it's efficient; primarily because it eliminates the correlated subquery (WHERE NOT EXISIS).

"c" is in there to make sure that b is directly below a - it says c (between them) can't be found (via the NULL test).

SELECT a.Timestamp, a.System, a.StatusA, a.StatusB
FROM tableX AS a
JOIN tableX AS b
    ON a.System = b.System
    AND a.Timestamp > b.Timestamp
LEFT JOIN tableX AS c
    ON a.System = b.System
    AND a.Timestamp > c.Timestamp
    AND b.Timestamp < c.Timestamp
WHERE c.System IS NULL
    AND a.StatusA <> b.StatusA;

edited Jul 03 '11 at 17:45

answered Jul 03 '11 at 08:30

dkretz

37,399
13
80
138

2

dorfier: Don't you mean: `LEFT JOIN tableX AS c ON a.System = c.System AND a.Timestamp > c.Timestamp AND c.Timestamp > b.Timestamp` ? – ypercubeᵀᴹ Jul 03 '11 at 08:59
Hmm, I can't seem to get this query to complete -- entering it in phpmyadmin just results in a long wait, and eventually phpmyadmin returns to home screen. I modified the query as ypercube suggested. – Jimmy Jul 03 '11 at 21:16
@le dorfier: I like this approach: What indexes does it need to have good performance? I tried various in my machine without good results. When the rows get 2x, time goes about 4x (the same happens with my first query and with Jiri's). – ypercubeᵀᴹ Jul 04 '11 at 08:34
@le dorfier: Another thought. Have you used it in MySQL before or only with other systems? Perhaps MySQL optimizer is not clever enough to find a good plan for this. – ypercubeᵀᴹ Jul 04 '11 at 08:36
1

Nope, i'm referring to MySQL. If you have an index on timestamp only, it's going to index directly to the record with almost no chance of duplicate values. Highest possible cardinality - you don't need any other fields in the index at all. – dkretz Jul 04 '11 at 18:34

Richard-code-gig · Answer 8 · 2023-02-23T05:30:49.513

0

Select * from table
Qualify lag(StatusA) is distinct from StatusA over (Partition by System order by Timestamp)
;

Since mysql don't have Qualify and is distinct from functions:

SELECT
    Timestamp, System, StatusA, StatusB 
FROM (
    SELECT
        *, lag(StatusA) OVER (Partition by System order by Timestamp) as prev 
    FROM
         table
) a
WHERE
    a.prev != StatusA AND a.prev is null
;

edited Feb 23 '23 at 05:30

answered Feb 16 '23 at 22:58

Richard-code-gig

36
5

SQL: selecting rows where column value changed from previous row

8 Answers8

Linked