1

Task: select sportsmen who participate in at least 2 competitions in a row (2 competitions go one after another; 1-2-3-4-5: 2&4 or 1&3&5 are not ok, 1&2 is ok, 1&2&3 is ok, 1&2 and 4&5 is ok). Question: find the best way (faster, less resources)

Working table:

enter image description here

There is single hold_date for each competition_id.

There is only one result for each sportsman_id per competition_id.

This works fine for 25 rows in result table:

SELECT DISTINCT sportsman_id, sportsman_name, rank, year_of_birth, personal_record, country
FROM
    (
    SELECT sportsman_id, hold_date,
        LAG (comp_order, 1) OVER (PARTITION BY sportsman_id ORDER BY sportsman_id) prev_comp_number
        , comp_order
    FROM result
    INNER JOIN
        (
        SELECT hold_date, ROW_NUMBER() OVER (ORDER BY hold_date) AS comp_order
        FROM
            (
            SELECT DISTINCT hold_date
            FROM result
            )
        ) USING (hold_date)
    ORDER BY sportsman_id, comp_order
    )
INNER JOIN sportsman USING (sportsman_id)
WHERE comp_order-prev_comp_number=1
;

screenshot of code with comments:

screenshot of code with comments

sample data:

sample data

result of code above (=desired result)

result of code above (=desired result)

Let's assume there are millions of rows (thousands of competitions and thousands of sportsmen). How reliable my code is?

What I think is decreasing the number of rows by excluding the rows if sportsman_id occurs just once (if sportsman took part (got result) at only 1 competition he obviously can't be the one). Something like this: (haven't implement tho (dunno how or most likely when/where))

SELECT re.hold_date, r.sportsman_id
FROM result r
INNER JOIN result re ON (re.sportsman_id=r.sportsman_id)
GROUP BY r.sportsman_id, re.hold_date
HAVING COUNT(r.sportsman_id) > 1
;

Then, I guess with LAG I only double existing column which is kinda fine?

Is there easier way with using PLSQL? Or there is a function which does some part of my code?

GMB
  • 216,147
  • 25
  • 84
  • 135
void_eater
  • 113
  • 1
  • 3
  • 9
  • 1
    Sample data and desired results would be a big help. – Gordon Linoff Oct 25 '20 at 12:37
  • added as screenshots – void_eater Oct 25 '20 at 12:48
  • Your data model shows that one competition can span several days (hence the date in the results table, which otherwise would be in the competitions table). Does this also mean that two competitions can overlap? Could I find one competition on September 5 and 6 and another on September 4 and 7? If so, how to go about this? – Thorsten Kettner Oct 25 '20 at 13:25
  • 1
    Please [edit] your question to include the sample data, commented code and desired results as text (and for the code, preferably as DDL/DML statements we can copy/paste). – MT0 Oct 25 '20 at 13:42
  • @ThorstenKettner As stated: There is single hold_date for each competition_id. No overlapping; so actually hold_date is unique. – void_eater Oct 25 '20 at 14:10
  • Okay. You should fix your data model then. – Thorsten Kettner Oct 25 '20 at 14:21

4 Answers4

2

You can do it by only reading the table once using the Tabibitosan method to group sequential competitions together https://www.red-gate.com/simple-talk/sql/t-sql-programming/the-sql-of-gaps-and-islands-in-sequences/#:%7E:text=The%20SQL%20of%20Gaps%20and%20Islands%20in%20Sequences,...%204%20Performance%20Comparison%20of%20Gaps%20Solutions.%20

Here you would have to use add_months because your competitions are months apart:

select sportsman_id, min(hold_date) , max(hold_date), comps_in_island
from (
 select  competition_id, sportsman_id, hold_date, island, count(*) over (partition by sportsman_id,island) comps_in_island
 from (
  select  competition_id, sportsman_id, hold_date , add_months(hold_date,-1*row_number() over(partition by sportsman_id order by hold_date)) island
  from    result
 )
)
where comps_in_island > 1
group by sportsman_id, island, comps_in_island;

DB fiddle: https://dbfiddle.uk/?rdbms=oracle_18&fiddle=1b707262722bc555ad851aee029b347a

-edit I got confused by some of the data, it looks like it's not the date that's important but the competition_id. This makes it simpler, if you have a gapless competition_id sequence (so competition 65786162213 was 65.7 billion events after 4)

select sportsman_id, min(competition_id) , max(competition_id), comps_in_island
from (
 select  competition_id, sportsman_id, hold_date, island, count(*) over (partition by sportsman_id,island) comps_in_island
 from 
  select  competition_id, sportsman_id, hold_date , competition_id -row_number() over(partition by sportsman_id order by competition_id)) island
  from    result
 )
)
where comps_in_island > 1
group by sportsman_id, island, comps_in_island;

Or if you need to work out the competition numbers first you just need an additional subquery using dense_rank to rank the unique competition_ids accounting for ties :

select sportsman_id, min(competition_id) , max(competition_id), comps_in_island
from (
 select  competition_id, sportsman_id, hold_date, island, count(*) over (partition by sportsman_id,island) comps_in_island
 from (
  select  competition_id, sportsman_id, hold_date , comp_number -row_number() over(partition by sportsman_id order by comp_number) island
  from (  
   select  competition_id, sportsman_id, hold_date , dense_rank() over (partition by null order by competition_id) comp_number
   from    result
  )
 )
)
where comps_in_island > 1
group by sportsman_id, island, comps_in_island;

This does assume that every possible competion_id you care about has a row in result.

Andrew Sayer
  • 2,296
  • 1
  • 7
  • 9
  • Does this rely on competitions being exactly 1 month apart? You appear to have copied my sample data (without citation) and based the answer on that. While it works for my data, the OP's data does not have this property. – MT0 Oct 25 '20 at 13:35
  • Ah, yes I was assuming you had used the same data as the OP provided, maybe not. Not a hard change to make though, hold on... – Andrew Sayer Oct 25 '20 at 13:37
  • Last part looks smooth, it works and you seem to use it based on performance too. Thank you. Will read article later. – void_eater Oct 25 '20 at 15:33
  • @AndrewSayer Tho I don't yet understand how this /count(*) over (partition by sportsman_id,island) comps_in_island/ worked out. upd: seems like I get it. And what are the weak parts of my code? – void_eater Oct 25 '20 at 15:44
  • 1
    count(*) over (partition by sportsman_id,island) gives you the number of rows that matched on sportsman_id and island. You can run the query at different subquery levels to look at what's going on. Your code is not necessarily weak, but it requires joining your large table with itself, you can achieve the same optimization I've done to limit that by using the dense_rank analytical function. The lag to check that the previous competition number for that sportsman_id is the global previous one works fine but you might find that expanding to get other information for the run is difficult. – Andrew Sayer Oct 25 '20 at 16:06
  • @AndrewSayer Yeah, in other words I would say it count combinations between sportsman_id and island values. And if there are more than 1 it means these competitions are consistent (what we know after dense_rank) for each sportsman. Ty – void_eater Oct 25 '20 at 16:19
  • So as far as I can see using PLSQL isn't giving any point or advantage to solve it? – void_eater Oct 25 '20 at 20:50
  • I can't see any solution here given with PL/SQL and that's because it's generally going to be slower (you would end up doing things row-by-row which won't scale as well as pure SQL. Of course, there are exceptions but this isn't one. Analytic functions give you so much power then any additional processing you would do with PL/SQL is a bit redundant. – Andrew Sayer Oct 25 '20 at 20:59
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/223602/discussion-between-void-eater-and-andrew-sayer). – void_eater Oct 25 '20 at 22:45
1

If you perform a partitioned outer join on your results with a full list of competitions then you will have NULL rows when a competitor did not attend a competition. Then you can use MATCH_RECOGNIZE to compare rows sequentially and COUNT the number of sequential competitions they attended and eliminate the sportsmen who only attended a competition but did not attend the competitions before and after.

SELECT sportsman_id
FROM   (
  SELECT sportsman_id,
         c.competition_id,
         c.hold_date,
         NVL2( r.competition_id, 1, 0 ) AS attended
  FROM   ( SELECT DISTINCT
                  competition_id,
                  hold_date
           FROM   result
         ) c
         LEFT OUTER JOIN result r
         PARTITION BY ( r.sportsman_id )
         ON ( c.competition_id = r.competition_id )
)
MATCH_RECOGNIZE (
  PARTITION BY sportsman_id
  ORDER BY hold_date
  MEASURES COUNT(*) AS num_sequential
  ONE ROW PER MATCH
  PATTERN ( ATTENDED_COMP+ )
  DEFINE
    ATTENDED_COMP AS (
      ATTENDED_COMP.attended = 1
    )
)
GROUP BY sportsman_id
HAVING MIN( num_sequential ) > 1;

So, for the sample data:

CREATE TABLE result ( competition_id, sportsman_id, hold_date ) AS
SELECT 1, 1, DATE '2020-01-01' FROM DUAL UNION ALL
SELECT 2, 1, DATE '2020-02-01' FROM DUAL UNION ALL
SELECT 3, 1, DATE '2020-03-01' FROM DUAL UNION ALL
SELECT 4, 1, DATE '2020-04-01' FROM DUAL UNION ALL
SELECT 5, 1, DATE '2020-05-01' FROM DUAL UNION ALL
SELECT 1, 2, DATE '2020-01-01' FROM DUAL UNION ALL
SELECT 2, 2, DATE '2020-02-01' FROM DUAL UNION ALL
SELECT 4, 2, DATE '2020-04-01' FROM DUAL UNION ALL
SELECT 5, 2, DATE '2020-05-01' FROM DUAL UNION ALL
SELECT 2, 3, DATE '2020-02-01' FROM DUAL UNION ALL
SELECT 4, 3, DATE '2020-04-01' FROM DUAL UNION ALL
SELECT 1, 4, DATE '2020-01-01' FROM DUAL UNION ALL
SELECT 3, 4, DATE '2020-03-01' FROM DUAL UNION ALL
SELECT 5, 4, DATE '2020-05-01' FROM DUAL UNION ALL
SELECT 1, 5, DATE '2020-01-01' FROM DUAL UNION ALL
SELECT 2, 5, DATE '2020-02-01' FROM DUAL UNION ALL
SELECT 5, 5, DATE '2020-05-01' FROM DUAL;

The output is:

| SPORTSMAN_ID |
| -----------: |
|            1 |
|            2 |

db<>fiddle here


If you want sportsmen who have attended any set of sequential competitions (regardless of whether not all their competitions are included in a sequential group) then you can change the final line to:

HAVING MAX( num_sequential ) > 1;

and the output would be:

| SPORTSMAN_ID |
| -----------: |
|            1 |
|            2 |
|            5 |

db<>fiddle here


Or, if you want details of the ranges that are matched you can use PATTERN ( ATTENDED_COMP{2,} ) to match only those sequential groups where a competitor attends two or more competitions in a row:

SELECT *
FROM   (
  SELECT sportsman_id,
         c.competition_id,
         c.hold_date,
         NVL2( r.competition_id, 1, 0 ) AS attended
  FROM   ( SELECT DISTINCT
                  competition_id,
                  hold_date
           FROM   result
         ) c
         LEFT OUTER JOIN result r
         PARTITION BY ( r.sportsman_id )
         ON ( c.competition_id = r.competition_id )
)
MATCH_RECOGNIZE (
  PARTITION BY sportsman_id
  ORDER BY hold_date
  MEASURES
    FIRST( competition_id ) AS first_competition_id,
    FIRST( hold_date ) AS first_hold_date,
    LAST( competition_id ) AS last_competition_id,
    LAST( hold_date ) AS last_hold_date
  ONE ROW PER MATCH
  PATTERN ( ATTENDED_COMP{2,} )
  DEFINE
    ATTENDED_COMP AS ( ATTENDED_COMP.attended = 1 )
)

Outputs:

SPORTSMAN_ID | FIRST_COMPETITION_ID | FIRST_HOLD_DATE     | LAST_COMPETITION_ID | LAST_HOLD_DATE     
-----------: | -------------------: | :------------------ | ------------------: | :------------------
           1 |                    1 | 2020-01-01 00:00:00 |                   5 | 2020-05-01 00:00:00
           2 |                    1 | 2020-01-01 00:00:00 |                   2 | 2020-02-01 00:00:00
           2 |                    4 | 2020-04-01 00:00:00 |                   5 | 2020-05-01 00:00:00
           5 |                    1 | 2020-01-01 00:00:00 |                   2 | 2020-02-01 00:00:00

db<>fiddle here

MT0
  • 143,790
  • 11
  • 59
  • 117
  • Shouldn't sportsman_id 5 count for that data, it exists in competition 1 and 2. – Andrew Sayer Oct 25 '20 at 13:49
  • 1
    @AndrewSayer The OP isn't clear on that; Sportsman `5` has attended 1, 2 & 5 so they both meet and don't meet the criteria. I erred for the side that if they have any competition where they did not attend the preceding or succeeding competition then they were excluded from the results. However, if they should be included then it is a simple matter to changing `MIN` to `MAX` in the final line. – MT0 Oct 25 '20 at 13:53
0

If you just want a list of sportmen who participated two consecutive competitions at least once, then using lag() juste once is sufficient:

select distinct sportman_id
from (
    select sportman_id, competition_id
        lag(competition_id) over(partition by sportman_id, oder by competition_id) lag_competition_id
    from result r
) r
where competition_id = lag_competition_id + 1

You can bring the corresponding sportsman row with exists:

select s.*
from sportman s
where exists (
    select 1
    from (
        select sportman_id, competition_id
            lag(competition_id) over(partition by sportman_id, oder by competition_id) lag_competition_id
        from result r
    ) r
    where r.competition_id = r.lag_competition_id + 1 and r.sportman_id = s.sportman_id
)
GMB
  • 216,147
  • 25
  • 84
  • 135
  • @void_eater: what problem are you facing with the query? – GMB Oct 25 '20 at 16:07
  • over(partition by sportman_id, oder by competition_id) - ",", oRder, sportSman_id . Then /ORA-00904: "COMPETITION_ID": invalid identifier/ – void_eater Oct 25 '20 at 16:11
  • @void_eater: my bad. We need to select that column in the subquery first, so we can use it in the outer query. Fixed. – GMB Oct 25 '20 at 16:12
0

You say that there is always only one date per competition. This date should hence reside in the competitions table, not in the results table. You also say that the dates don't overlap (no two competitions on the same date - which would also be possible to ensure with a constraint, were the date in the competition table).

In the first step get the competitions / dates in order. With your data model:

select distinct hold_date
from result
order by hold_date;

To get this result quickly, provide an index on the date:

create index idx1 on result (hold_date);

You can even number these with ROW_NUMBER or use LAG or LEAD to see a date with its neighbouring date(s).

Now, the optimal method to look for sportsmen who participated in two consecutive events depends a lot from how often sportmen participate in general.

  1. If they rarely participate, say, usually just two times, we can just join and quickly look at the results.
  2. If they participate a lot, say, usually in about half of the events, we want to iterate through the events and stop once we find consecutive events, rather than read on and on.

Here is a query for the second approach. We use a recursive query (because this is how we apply iterate processes in SQL). We start with all sportsmen and the first date. Then we go to the second date and stop for all who participated in both. For the rest we look at the third date and again stop for those who participated in the second and third. And so on.

There should be an index on date and sportsman to look up a result row quickly. I'd even provide two indexes, because I don't know which column is more selective. So, let the DBMS decide.

create index idx2 on result (hold_date, sportsman_id);
create index idx3 on result (sportsman_id, hold_date);

And here is the query:

with dates as 
(
  select
    hold_date,
    lead(hold_date) over (order by hold_date) as next_date,
    min(hold_date) over (order by hold_date) as min_date
  from (select distinct hold_date from result)
)
, cte (sportsman_id, sportsman_name, rank, year_of_birth, personal_record, country,
       hold_date, next_date, was_in, is_in) as
(
  select
    s.sportsman_id, s.sportsman_name, s.rank, s.year_of_birth,
    s.personal_record, s.country, d.hold_date, d.next_date, 'NO',
    case when r.hold_date is not null then 'YES' else 'NO' end
  from sportsman s
  cross join (select * from dates where hold_date = min_date) d
  left join result r on r.sportsman_id = s.sportsman_id
                     and r.hold_date = d.hold_date
  union all
  select
    s.sportsman_id, s.sportsman_name, s.rank, s.year_of_birth,
    s.personal_record, s.country, d.hold_date, d.next_date, s.is_in,
    case when r.hold_date is not null then 'YES' else 'NO' end
  from cte s
  join dates d on d.hold_date = s.next_date
  left join result r on r.sportsman_id = s.sportsman_id
                     and r.hold_date = d.hold_date
  where not (s.was_in = 'YES' and s.is_in = 'YES')
)
select sportsman_id, sportsman_name, rank, year_of_birth, personal_record, country
from cte
where was_in = 'YES' and is_in = 'YES';
Thorsten Kettner
  • 89,309
  • 7
  • 49
  • 73
  • Thanks for debriefing as I'm learning. But it's a test table/relations and I guess its creator didn't care much about it. But honestly, the answer from Andrew Sayer and "Island" concept seems so much easier to understand and also most likely more efficient (can't be sure). – void_eater Oct 25 '20 at 22:44
  • Yes, it probably is. As mentioned mine is only appropriate when you expect the sportsmen to participate in many of the events. If almost all sportsmen participate in almost all events each and you only want to filter out the very few who didn't participate in any consecutive events, then it's a good approach to only look up the few events for each sportsman until you find the first consecutive event and not go through all the data. – Thorsten Kettner Oct 25 '20 at 22:51