0

I am trying to write a PostgreSQL query to calculate fantasy baseball points, as described here. What I have so far is available in this SQLfiddle, which calculates the points for each statistic correctly, except in the case of ties. Points for teams that are tied are supposed to be calculated like so:

In the case of a tie, each team involved receives an average of the total points due — i.e., in the above example, if two teams were tied for first in a category, each would receive 9.5 points [(10 + 9) / 2 = 9.5].

You can see the errors in my approach in the first and second result sets in my SQLfiddle. In the first result set, the teams tied with 9 home runs should each get 3.5 points (sum of ranks 4 and 3 == 7, divided by 2) while in the second set, the teams tied with 33 RBI should also each get 3.5 (sum of ranks 5, 4, 3, and 2 == 14, divided by 4).

What would be the simplest way to correct these errors, and distribute the total points evenly among teams tied in the rankings of each statistic?

Community
  • 1
  • 1
tonycpsu
  • 407
  • 5
  • 13
  • Can you explain what is `hr` and what is `rbi` in your schema? Also, if you're aiming and floating scores (like `9.5`), why do you use `integer` and not `numeric` type? – vyegorov May 22 '14 at 17:54
  • HR and RBI are the values I want to rank over -- in this case, [home runs](http://en.wikipedia.org/wiki/Home_run) and [runs batted in](http://en.wikipedia.org/wiki/Run_batted_in) for baseball teams. These are integers, but in the case of a tie, the rankings of the teams should be averaged together, as in the Yahoo link in my question. – tonycpsu May 22 '14 at 18:06

2 Answers2

1

A brute force method might be to calculate off the unadjusted rank, something like this:

select hr, sum(raw) / count(*)
  from (
      select hr,
             (select count(*) + 1 from stats) - row_number() over (order by hr desc) as raw
        from stats
    ) r
 group by hr
 order by hr desc

and join that with the stats table to get the points for a given score.

John Bickers
  • 481
  • 2
  • 6
  • That's a good start, but it'll get quite unwieldy repeating that subquery (with joins) for each of the ten statistics. Hopefully there's an easier way. – tonycpsu May 17 '14 at 21:03
  • Unwieldy is not uncommon with SQL. Maybe it can be turned into a function somehow, where the stats column is a parameter. – John Bickers May 17 '14 at 21:19
  • Postgres is smart enough to avoid doing repetitive scans for the inner sub-query. You can confirm this by looking at the `EXPLAIN` output. – vyegorov May 23 '14 at 07:08
1

In my view, provided example is not a complete one, it'd be nice to have a candidate key among the provided fields.

  1. It is possible to use any aggregate function as window one. This removes the necessity for the sub-query to count all rows in the table.

Consider the output of the following query:

SELECT 
    hr,rbi,
    rank() OVER h AS hr_rank,
    row_number() OVER h AS hr_rn,
    count(*) OVER () - rank() OVER h + 1 AS hr_aprx,
    rank() OVER r AS rbi_rank,
    row_number() OVER r AS rbi_rn,
    count(*) OVER () - rank() OVER r + 1 AS rbi_aprx,
    count(*) OVER () AS cnt
FROM 
    stats
WINDOW h AS (ORDER BY hr DESC), r AS (ORDER BY rbi DESC);

This query provides the same information as your first 2 queries. If you'll look into the EXPLAIN (analyze, buffers) output for it, you'll see that table is being accessed only once.

I've named point columns as %_aprx here, for these are approximate points yet, we'll have to calculate the average.

  1. Now, as we have prepared some data for further calculations, we'll have to use a sub-query. This is due to the fact, that we must use our %_aprx columns for data grouping. I will use CTE here, as I find named sub-queries looking better.

Consider this query (also on sql-fiddle):

WITH ranks AS (
    SELECT 
        hr, rbi,
        rank() OVER h AS hr_rank,
        row_number() OVER h AS hr_rn,
        count(*) OVER () - rank() OVER h + 1 AS hr_aprx,
        rank() OVER r AS rbi_rank,
        row_number() OVER r AS rbi_rn,
        count(*) OVER () - rank() OVER r + 1 AS rbi_aprx,
        count(*) OVER () AS cnt
    FROM 
        stats
    WINDOW h AS (ORDER BY hr DESC), r AS (ORDER BY rbi DESC)
)
SELECT 
    hr, rbi,
    (avg(hr_rn) OVER h)::float AS hr_pts,
    (avg(rbi_rn) OVER r)::float AS rbi_pts,
    (avg(hr_rn) OVER h + avg(rbi_rn) OVER r)::float AS ttl_pts
FROM 
    ranks
WINDOW h AS (PARTITION BY hr_aprx), r AS (PARTITION BY rbi_aprx)
ORDER BY 
    ttl_pts DESC, hr_pts DESC;

I'm converting resulting type of avg() calls to float to get rid of the series of zeroes. You can choose to use round() function instead here though.

I've also added 2 ordering conditions, for ordering just by ttl_pts is not enough.

Note, that in the outer query's window definitions ORDER BY is missed out on purpose. With it, you'll get a running average effect (you can change query and see yourself).

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
vyegorov
  • 21,787
  • 7
  • 59
  • 73
  • Brilliant! Yes, my example was missing some details, but intentionally so. I stripped it down to a "minimal working example", because the "table" I'm running this query on is actually the result of another complex nested query that aggregate the team results from individual games, and I didn't want to make the problem look more complex than it is. With your solution I can either build a temporary table with the aggregate results periodically, or adapt it to run over the aggregate query. Thanks! – tonycpsu May 23 '14 at 14:45