Select finishes where athlete didn't finish first for the past 3 events

Question

Suppose I have a database of athletic meeting results with a schema as follows

DATE,NAME,FINISH_POS

I wish to do a query to select all rows where an athlete has competed in at least three events without winning. For example with the following sample data

2013-06-22,Johnson,2
2013-06-21,Johnson,1
2013-06-20,Johnson,4
2013-06-19,Johnson,2
2013-06-18,Johnson,3
2013-06-17,Johnson,4
2013-06-16,Johnson,3
2013-06-15,Johnson,1

The following rows:

2013-06-20,Johnson,4
2013-06-19,Johnson,2

Would be matched. I have only managed to get started at the following stub:

select date,name FROM table WHERE ...;

I've been trying to wrap my head around the where clause but I can't even get a start

Won on 2013-06-15 therefore ignore next three entries. Then grab 2013-16-19, 2013-06-20 and ignore 2013-06-22 since it is only one event after last victory — deltanovember, Jun 22 '13 at 05:50
Just a quick tip: "date" is a terrible column name; it's a type name, so you'll have to "double quote" it in some places to avoid syntax errors. Avoid using type names or keywords as column names. — Craig Ringer, Jun 22 '13 at 10:40
Please always include your PostgreSQL version in questions. In this case, both Andomar and I have used features only available since 8.4 (possibly newer). — Craig Ringer, Jun 22 '13 at 11:13

score 4 · Accepted Answer · edited May 23 '17 at 11:50

I think this can be even simpler / faster:

SELECT day, place, athlete
FROM  (
   SELECT *, min(place) OVER (PARTITION BY athlete
                              ORDER BY day
                              ROWS 3 PRECEDING) AS best
   FROM   t
   ) sub
WHERE  best > 1

->SQLfiddle

Uses the aggregate function min() as window function to get the minimum place of the last three rows plus the current one.
The then trivial check for "no win" (best > 1) has to be done on the next query level since window functions are applied after the WHERE clause. So you need at least one CTE of sub-select for a condition on the result of a window function.

Details about window function calls in the manual here. In particular:

If frame_end is omitted it defaults to CURRENT ROW.

If place (finishing_pos) can be NULL, use this instead:

WHERE  best IS DISTINCT FROM 1

min() ignores NULL values, but if all rows in the frame are NULL, the result is NULL.

Don't use type names and reserved words as identifiers, I substituted day for your date.

This assumes at most 1 competition per day, else you have to define how to deal with peers in the time line or use timestamp instead of date.

@Craig already mentioned the index to make this fast.

+1 kind of funny you find the best performance with the `min` aggregate :D — Andomar, Jun 22 '13 at 21:39
Very good point regarding use of `min`; that's considerably better than using `array_agg` and a contains test. — Craig Ringer, Jun 23 '13 at 06:42

Craig Ringer · Answer 2 · 2013-06-22T11:08:24.203

Here's an alternative formulation that does the work in two scans without subqueries:

SELECT
  "date", athlete, place
FROM (
  SELECT 
    "date",
    place,
    athlete,
    1 <> ALL (array_agg(place) OVER w) AS include_row
  FROM Table1
  WINDOW w AS (PARTITION BY athlete ORDER BY "date" ASC ROWS BETWEEN 3 PRECEDING AND CURRENT ROW)
) AS history
WHERE include_row;

See: http://sqlfiddle.com/#!1/fa3a4/34

The logic here is pretty much a literal translation of the question. Get the last four placements - current and the previous 3 - and return any rows in which the athlete didn't finish first in any of them.

Because the window frame is the only place where the number of rows of history to consider is defined, you can parameterise this variant unlike my previous effort (obsolete, http://sqlfiddle.com/#!1/fa3a4/31), so it works for the last n for any n. It's also a lot more efficient than the last try.

I'd be really interested in the relative efficiency of this vs @Andomar's query when executed on a dataset of non-trivial size. They're pretty much exactly the same on this tiny dataset. An index on Table1(athlete, "date") would be required for this to perform optimally on a large data set.

Andomar · Answer 3 · 2013-06-22T06:02:18.963

1

; with  CTE as
        (
        select  row_number() over (partition by athlete order by date) rn
        ,       *
        from    Table1
        )
select  *
from    CTE cur
where   not exists
        (
        select  *
        from    CTE prev
        where   prev.place = 1
                and prev.athlete = cur.athlete
                and prev.rn between cur.rn - 3 and cur.rn
        )

Live example at SQL Fiddle.

edited Jun 22 '13 at 06:02

answered Jun 22 '13 at 05:54

Andomar

232,371
49
380
404

Interesting approach. I'm pretty impressed by that fact that Pg executes it with a hash anti-join rather than a nested loop, too. – Craig Ringer Jun 22 '13 at 11:04

Select finishes where athlete didn't finish first for the past 3 events

3 Answers3

Linked