4

Given the following table:

CREATE TABLE table
(
 "id" serial NOT NULL,
 "timestamp" timestamp without time zone NOT NULL,
 "count" integer NOT NULL DEFAULT 0
)

I am searching for "rare events". A rare event is a row that owns the following properties:

  • Simple: count = 1
  • Hard: All rows within a 10 minute timespan (before and after the current row's timestamp) have count = 0 (except the given row, of course).

Example:

id   timestamp  count
0    08:00      0    
1    08:11      0    
2    08:15      2     <== not rare event (count!=1)   
3    08:19      0    
4    08:24      0    
5    08:25      0   
6    08:29      1     <== not rare event (see 8:35)
7    08:31      0    
8    08:35      1    
9    08:40      0    
10   08:46      1     <== rare event!  
10   08:48      0   
10   08:51      0   
10   08:55      0   
10   08:58      1     <== rare event!  
10   09:02      0   
10   09:09      1

Right now, I have the following PL/pgSQL-function:

SELECT curr.* 
    FROM gm_inductionloopdata curr
    WHERE curr.count = 1
    AND (
      SELECT SUM(count)
      FROM gm_inductionloopdata
      WHERE timestamp BETWEEN curr.timestamp + '10 minutes'::INTERVAL
      AND curr.timestamp - '10 minutes'::INTERVAL
    )<2

which is dead slow. :-(

Any suggestions on how to improve performance? I am working on > 1 mio rows here and might need to find those "rare events" on a regular basis.

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
Ronk
  • 53
  • 3
  • 1
    Your query is wrong, there is no WHERE and the BETWEEN is flipped (should be `- '10 min...' AND ... + '10 min...')... – MatheusOl Sep 03 '13 at 17:07
  • Also, have you tried an index on timestamp column to see if your current query works well? – MatheusOl Sep 03 '13 at 17:08
  • 1
    Are there really seven rows with the same ID? Does "timestamp" really mean "time"? (In SQL databases, *timestamp* usually means date and time, not just time, which is a different data type altogether in PostgreSQL.) – Mike Sherrill 'Cat Recall' Sep 03 '13 at 23:32

2 Answers2

3

I think this is a good case to use lead and lag window functions - this query filters all records with count = 1 and then get previous and next row to see if it closer than 10 minutes:

with cte as (
  select
      "id", "timestamp", "count",
      lag("timestamp") over(w) + '10 minutes'::interval as "lag_timestamp",
      lead("timestamp") over(w) - '10 minutes'::interval as "lead_timestamp"
  from gm_inductionloopdata as curr
  where curr."count" <> 0
  window w as (order by "timestamp")
)
select "id", "timestamp"
from cte
where
    "count" = 1 and
    ("lag_timestamp" is null or "lag_timestamp" < "timestamp") and
    ("lead_timestamp" is null or "lead_timestamp" > "timestamp")

sql fiddle demo

Or you can try this, and ensure that you have index on timestamp column of your table:

select *
from gm_inductionloopdata as curr
where
    curr."count" = 1 and
    not exists (
        select *
        from gm_inductionloopdata as g
        where 
           -- you can change this to between, I've used this just for readability
           g."timestamp" <= curr."timestamp" + '10 minutes'::interval and
           g."timestamp" >= curr."timestamp" - '10 minutes'::interval and
           g."id" <> curr."id" and
           g."count" = 1
    );

sql fiddle demo

BTW, please don't call your columns "count", "timestamp" or other keywords, function names and type names.

Roman Pekar
  • 107,110
  • 28
  • 195
  • 197
  • Your first example is a nice use of lead and lag, but does not solve the problem. As the problem states, it can't have rows near 10 minutes with count<>0, it doesn't mean the rows is near by one row (as lead and lag assumes). – MatheusOl Sep 03 '13 at 17:10
  • @MatheusOl My first query filters all rows with count = 1 and then check if closest one is farther than 10 minutes, so it should do the trick – Roman Pekar Sep 03 '13 at 17:16
  • yeah, but it will not work if there is a row farther than 10 minutes and also farther from 2 or more rows, as the case of 08:29 (from question's example), the lead of it is 08:31, but 08:35 is also within 10 minutes and you are not considering this one. – MatheusOl Sep 03 '13 at 17:18
  • @MatheusOl have you check the SQL fiddle? There's 2 cte, I get lag and lead from ALREADY FILTERED list – Roman Pekar Sep 03 '13 at 17:23
  • ok, I got the "already filtered" now... But there is still a small problem, if the nearby count is not **exactly** 1. But it is easy to solve: http://sqlfiddle.com/#!12/cea93/1/0 (notice I changed the line of 08:35). – MatheusOl Sep 03 '13 at 19:21
  • nice, +1. You don't need 2 CTE's , You can just stick `"count" <> 0` into the second one. – maniek Sep 03 '13 at 22:18
2

This can be faster, yet (improving the 1st solution of @Roman).

SELECT id, ts, ct
FROM  (
    SELECT id, ts, ct
        ,lag (ts, 1, '-infinity') OVER (ORDER BY ts) as prev_ts
        ,lead(ts, 1,  'infinity') OVER (ORDER BY ts) as next_ts
    FROM   tbl
    WHERE  ct <> 0
    ) sub
WHERE  ct = 1
AND    prev_ts < ts - interval '10 min'
AND    next_ts > ts + interval '10 min'
ORDER  BY ts;
  • Handling of the corner cases of "no leading / lagging row" can be much simplified with the following two pieces of information:

  • Subqueries are generally more efficient than CTEs (some exceptions apply), since CTEs introduce optimization barriers (by design and on purpose). If performance matters, use CTEs only when you need them.

Also:

  • I use proper column names instead of timestamp and count, thereby removing the need to double-quote identifiers. Never use reserved words or basic type or function names as identifiers.

  • None of this has anything to do with , which is the default procedural language of Postgres.

SQL Fiddle.

Index

Since we are dealing with a big table (> 1 mio rows) and are only interested in "rare events" the important thing for performance is a partial index like the following:

CREATE INDEX tbl_rare_idx ON tbl(ts) WHERE ct <> 0;

If you are on Postgres 9.2 or later and given some preconditions, make that a covering index for index-only scans.

CREATE INDEX tbl_rare_covering_idx ON tbl(ts, ct, id)
WHERE ct <> 0;

Test with EXPLAIN ANALYZE to see which query is faster and whether the index is used.

Community
  • 1
  • 1
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228