1

I'm playing around with the Lahman Baseball Database in a MySQL instance. I want to find the players who topped home runs (HR) for each year. The Batting table has the following (relevant parts) of its schema:

+-----------+----------------------+------+-----+---------+-------+
| Field     | Type                 | Null | Key | Default | Extra |
+-----------+----------------------+------+-----+---------+-------+
| playerID  | varchar(9)           | NO   | PRI |         |       |
| yearID    | smallint(4) unsigned | NO   | PRI | 0       |       |
| HR        | smallint(3) unsigned | YES  |     | NULL    |       |
+-----------+----------------------+------+-----+---------+-------+

For each year, every player has an entry (between hundreds and 12k per year, going back to 1871). Getting the top N hitters for a single year is easy:

SELECT playerID,yearID,HR
FROM Batting
WHERE yearID=2009
ORDER BY HR DESC LIMIT 3;
+-----------+--------+------+
| playerID  | yearID | HR   |
+-----------+--------+------+
| pujolal01 |   2009 |   47 |
| fieldpr01 |   2009 |   46 |
| howarry01 |   2009 |   45 |
+-----------+--------+------+

But I'm interested in finding the top 3 from every year. I've found solutions like this, describing how to select the top from a category and I've tried to apply it to my problem, only to end up with a query that never returns:

SELECT
    b.yearID, b.playerID, b.HR
FROM
    Batting AS b
LEFT JOIN
    Batting b2
    ON
    (b.yearID=b2.yearID AND b.HR <= b2.HR)
GROUP BY b.yearID HAVING COUNT(*) <= 3;

Where have I gone wrong?

Community
  • 1
  • 1
Drew Stephens
  • 17,207
  • 15
  • 66
  • 82

2 Answers2

3

Something like this should work:

SELECT b.playerID, b.yearID, b.HR
FROM Batting b 
WHERE HR >= (
    SELECT b2.HR 
    FROM Batting b2 
    WHERE b2.yearID=b1.yearID
    ORDER BY b2.HR DESC
    LIMIT 2, 1
)
ORDER BY b.yearID DESC, b.HR DESC;

Explanation: select all rows which have >= number of home runs as the third highest for that year. This won't break ties. So if there's more than one batter with the same number of home runs, they'll all show up.

The results are the ordered from the most recent year, sub-ordered by rank for each year.

Note: LIMIT is a 0-based offset, so 2, 1 means starting after the second row grab one row, i.e.: the third row.

Ben S
  • 68,394
  • 30
  • 171
  • 212
  • Plus 1 for the LIMIT explanation. – Ryan Liang Jun 17 '10 at 17:10
  • The LIMIT arguments actually go the other way: offset, then number of rows. Also, you have an error in the subquery—`b1` should be just `b`. Other than that, this is correct. It took 4:18 to find *just* the results since 2005 (Macbook Pro, OS X 10.6.3, Core 2 2.5GHz, enough RAM to have all the data in memory), so perhaps there are some optimizations to be made. – Drew Stephens Jun 17 '10 at 17:29
  • Thanks for pointing out the limit details. I edited my answer. Is you query cache actually set to large enough to hold everything in memory though? The reason it's so slow is that row every row in Batting, it does the subquery which is quite expensive. Possible optimizations would be to add an index on yearID and adding another index on HR. A further optimization would be to make a temporary table that holds each year along with the third-highest HR count to compare against. – Ben S Jun 17 '10 at 17:43
  • Awesome, thanks for the tips! Indexing yearID & HR got the query done in under 2 seconds. – Drew Stephens Jun 18 '10 at 02:24
0

Wow, random. I happened to be doing the same query (for salaries though) on the Lahman Baseball DB using an article on emulating Oracle Analytic functions. This version of the query is snappy, but not so intuitive.

select *
from (

select 
    b.yearID as year,
    b.teamID as team,
    m.nameFirst as first,
    m.nameLast as last,
    find_in_set(b.HR, x.teamRank) as rank,
    b.HR as HR


from 
    Batting b
    inner join Master m on m.playerID = b.playerID
    inner join (select yearID, group_concat(distinct HR order by HR desc) as teamRank from Batting group by yearID) x on x.yearID = b.yearID

) x

where 
    rank <= 10 and rank > 0 

order by    
    year desc, rank

Or the top 5 HR totals per team for the year 2010...

select *
from (

select 
    b.yearID as year,
    b.teamID as team,
    m.nameFirst as first, 
    m.nameLast as last, 
    b.HR as HR,
    find_in_set(b.HR, x.teamRank) as rank

from 
    Batting b
    inner join Master m on m.playerID = b.playerID
    inner join (select teamID, group_concat(distinct HR order by HR desc) as teamRank from Batting where yearID = 2010 group by teamID) x on x.teamID = b.teamID
where 
    b.yearID = 2010
) x

where 
    rank <= 5 and rank > 0 

order by    
    team, rank

limit 12

Shows these results...

year    team    first   last        HR  rank
2010    ARI Mark    Reynolds    32  1
2010    ARI Chris   Young       27  2
2010    ARI Kelly   Johnson     26  3
2010    ARI Adam    LaRoche     25  4
2010    ARI Justin  Upton       17  5
2010    ATL Brian   McCann      21  1
2010    ATL Jason   Heyward     18  2
2010    ATL Troy    Glaus       16  3
2010    ATL Martin  Prado       15  4
2010    ATL Eric    Hinske      11  5
2010    BAL Luke    Scott       27  1
2010    BAL Ty      Wigginton   22  2
Tommy O'Dell
  • 7,019
  • 13
  • 56
  • 69