0

I'm struggling with determining the MySQL code to create a column (average park_factor) in my table "starting_pitcher_stats" that I'd like to contain the season-to-date in-season average values of the values in another column (park_factor). I'd like this in-season average to be grouped by pitcher and by date.

Ideally, the table would look like this:

    pitcher     park_fac   avg_park_fac       date
    aased001    94         94           1977-07-31
    aased001    100        97           1977-08-06
    aased001    108        100.666      1977-08-11
    aased001    108        102.5        1977-08-16
    aased001    96         101.2        1977-08-21
    aased001    108        102.33       1977-08-26
    aased001    108        103.14       1977-08-31
    aased001    104        103.25       1977-09-05
    aased001    108        103.77       1977-09-10
    aased001    92         102.6        1977-09-16
    aased001    106        102.9        1977-09-22
    aased001    108        103.33       1977-09-27

The code I'm using is:

SELECT Starting_Pitcher, full_park_factor, AVG(full_park_factor), Game_Date
FROM starting_pitcher_stats 
GROUP BY Starting_Pitcher, Game_Date, Game_Number

...and a sample of the resulting table looks like this:

pitcher     park_fac   avg_park_fac  date
aased001    94         94.0000      1977-07-31
aased001    100        100.0000     1977-08-06
aased001    108        108.0000     1977-08-11
aased001    108        108.0000     1977-08-16
aased001    96         96.0000      1977-08-21
aased001    108        108.0000     1977-08-26
aased001    108        108.0000     1977-08-31
aased001    104        104.0000     1977-09-05
aased001    108        108.0000     1977-09-10
aased001    92         92.0000      1977-09-16
aased001    106        106.0000     1977-09-22
aased001    108        108.0000     1977-09-27

Can someone help please?

Thank you in advance for help with this. Lee

LeeZee
  • 107
  • 1
  • 12
  • 1
    I don't know what you want us to help you with. Consider providing proper CREATE and INSERT statements corresponding to the desired result – Strawberry May 14 '16 at 06:32
  • Hi strawberry, I'm just looking to create a column avg_Park_factor that has the average value of the corresponding row value + values of rows from the previous dates within a given year of the park_factors column. – LeeZee May 14 '16 at 14:02

2 Answers2

2

You need to join your table on all previous results for the same pitcher from the same table.

I'm not quite sure how you define your season, but assuming it's by calendar year, the following query produces the desired output.

SELECT
    a.Starting_Pitcher, a.full_park_factor,
    AVG(b.full_park_factor), a.Game_Date, a.Game_Number
FROM starting_pitcher_stats a
INNER JOIN starting_pitcher_stats b
    ON a.Starting_Pitcher = b.Starting_Pitcher
    AND (b.Game_Date < a.Game_Date OR
         (b.Game_Date = a.Game_Date AND b.Game_Number <= a.Game_Number))
    AND YEAR(b.Game_Date) = YEAR(a.Game_Date)
GROUP BY a.Starting_Pitcher, a.Game_Date, a.Game_Number;

You seem to want to update a column in your table with the result of this calculation. This can be achieved with real-time updates with a trigger which updates the column whenever you insert or update existing data or using a view.

CREATE VIEW starting_pitcher_stats_with_average AS
SELECT
    a.Starting_Pitcher, a.full_park_factor,
    AVG(b.full_park_factor), a.Game_Date, a.Game_Number
FROM starting_pitcher_stats a
INNER JOIN starting_pitcher_stats b
    ON a.Starting_Pitcher = b.Starting_Pitcher
    AND (b.Game_Date < a.Game_Date OR
         (b.Game_Date = a.Game_Date AND b.Game_Number <= a.Game_Number))
    AND YEAR(b.Game_Date) = YEAR(a.Game_Date)
GROUP BY a.Starting_Pitcher, a.Game_Date, a.Game_Number;

In your own answer you create a procedure to update the average column for all records in the table in one go, so perhaps you don't want to have the column update as you insert data, but simply be able to add the average for all rows on demand. In this case, you can write an UPDATE statement which includes the SELECT query above as a subquery. Because MySQL cannot use the same table for the UPDATE and the subquery, you have to wrap the subquery in another SELECT so that MySQL generates a temporary table from your results.

UPDATE starting_pitcher_stats c
SET c.std_F_parkfactor = (
    SELECT d.std_F_parkfactor FROM (
        SELECT
            a.Starting_Pitcher,
            AVG(b.full_park_factor) std_F_parkfactor,
            a.Game_Date, a.Game_Number
        FROM starting_pitcher_stats a
        INNER JOIN starting_pitcher_stats b
            ON a.Starting_Pitcher = b.Starting_Pitcher
            AND (b.Game_Date < a.Game_Date OR
                 (b.Game_Date = a.Game_Date
                  AND b.Game_Number <= a.Game_Number))
            AND YEAR(b.Game_Date) = YEAR(a.Game_Date)
        GROUP BY a.Starting_Pitcher, a.Game_Date, a.Game_Number
    ) d
    WHERE c.Starting_Pitcher = d.Starting_Pitcher
    AND c.Game_Date = d.Game_Date
    AND c.Game_Number = d.Game_Number
);
Matt Raines
  • 4,149
  • 8
  • 31
  • 34
  • Yep, I think that's it - the penultimate line may or may not be necessary – Strawberry May 14 '16 at 14:10
  • Matt, thank you! This works great. Now trying to add AVG(b.full_park_factor) field as a column to the table with "ALTER TABLE starting_pitcher_stats ADD COLUMN AVG(b.full_park_factor), but it won't work... The Game_Number column just refers to the game number of a double-header if it occurred on a given date(i.e., 0=game 1; 1=game 1 of doubleheader if it occurred; 2=game 2 of a double-header). But I can just add that column to the code. Thank you. – LeeZee May 14 '16 at 16:49
  • MySQL [doesn't support the use of expressions in default values](http://stackoverflow.com/questions/270309/can-i-use-a-function-for-a-default-value-in-mysql). You could write a trigger to update a column in the table on insert, but it will only work if you always enter the stats in date order. The minute you miss one and enter it later, you'd have to update all the following records in the table. You might be better off looking into creating [a view from your `SELECT` statement](http://dev.mysql.com/doc/refman/5.7/en/create-view.html) depending on your requirements. – Matt Raines May 14 '16 at 17:13
  • I've edited the answer to reflect the fact that more than one game can take place on the same day, referenced by `Game_Number`. – Matt Raines May 14 '16 at 17:16
  • Thanks Matt. I will look into those other options, especially creating a view. But then I wonder about the feasibility to later analyze the data (including the data in that avg_Park_factor column In a view) in that table in a stats program like R. Will the data from that column still be "available" and manipulatable for a data analysis program even if only stored in a view? I wonder is it possible to create a storable data column for average_park_factor if I can recreate what the expression does without using that AVG expression, like avg_park_factor=@park_factor+park_factor/@row number? – LeeZee May 14 '16 at 19:12
  • You're outside my area of expertise now; I've never used R. It surprises me that you can't calculate the in-season average in it rather than MySQL though. Maybe open a follow-up question? – Matt Raines May 14 '16 at 19:39
  • Ok, I found a way to do this (i.e., store the season-to-date park factor average in a column in a table) as a work-around to using the AVG function (which isn't a method which is supported by MySQL). Please see above edit to the OP with the code that worked. – LeeZee May 16 '16 at 05:55
  • I think you should post that as a new answer rather than an edit. But, I'm confused about what you're trying to achieve. I've edited my answer with some new suggestions but your procedure looks unnecessarily complicated if it does what I think it does. – Matt Raines May 16 '16 at 08:55
  • Hi Matt, what you added is very useful. Thank you. Right now, I'm trying to re-organize MLB data already collected from 1974 through the end of the 2014 season. However, I'm realizing that your additional codes will come in handy when I start doing real-time in-season daily updates based on the day's games just played--When I do, it sounds a trigger + your view code will be necessary to automatically update the table as soon as new data is imported into it. As for the procedure, I'll have to figure out how to eliminate unnecessary code from it. – LeeZee May 17 '16 at 05:40
  • Matt, when I do start to insert new data on a daily basis, is it pretty straightforward to write a trigger that will update this column upon running an UPDATE or VIEW statement? – LeeZee May 17 '16 at 18:05
  • It would be straightforward, but quite inefficient, to write a trigger that did this update of the whole table after every `INSERT` or `UPDATE` to the table. If you wanted to only update rows that were affected by the change, that would be a little more complicated. There should be enough material in [the manual](http://dev.mysql.com/doc/refman/5.7/en/trigger-syntax.html) to get you started. If you use a view, you don't need to update it. The query is run whenever the view is used. – Matt Raines May 17 '16 at 18:35
0

UPDATE: Here's a way to store the season-to-date (in-season) park-factor average of a given season in a column, which in this case is the average of another column's values in the same table using a stored procedure. It essentially calculates the average by dividing by the row_number which corresponds to the number of rows of values that have been looped through in calculating this variable. This approach works if you have already-collected data that you want to update in one go or seldomly, but as implied by Matt Raines, may have to be run more frequently than his suggested approach(es). If the table will get updated frequently with data from successive days' games' results at least once daily, I think it will be less labor-intensive to use his approach(es). Please let me know what can be eliminated:

DROP PROCEDURE IF EXISTS std_park_factor_avg;
DELIMITER $$
CREATE PROCEDURE std_park_factor_avg()
BEGIN
    DECLARE pit_id CHAR(10);
    DECLARE lgID CHAR (2); 
    DECLARE YEARID INT;
    DECLARE gdate DATE;
    DECLARE seq INT;
    DECLARE F_park_factor INT;
    DECLARE RNUMBER INT;
    DECLARE accum_F_parkfactor REAL;
    DECLARE accum_row_number INT;
    DECLARE accum_avg_F_parkfactor REAL;
    DECLARE prev_year YEAR(4);
    DECLARE end_of_cursor BOOLEAN;

    DECLARE no_table CONDITION FOR SQLSTATE '42S02';

    DECLARE c1 CURSOR FOR
      SELECT Starting_Pitcher, lg_ID, YEAR_ID, Game_Date, Game_Number, full_park_factor, ROW_NUMBER 
        FROM starting_pitcher_stats
        GROUP BY Starting_Pitcher, lg_ID, YEAR_ID, Game_Date, Game_Number;

    DECLARE CONTINUE HANDLER FOR NOT FOUND
      SET end_of_cursor := TRUE;

    SET end_of_cursor := FALSE;  -- reset
    SET prev_year := 0;          -- reset control-break

    OPEN c1;

    fetch_loop: LOOP
      FETCH c1 INTO pit_id, lgID, YEARID, gdate,seq, F_park_factor, RNUMBER;
      IF end_of_cursor THEN
        LEAVE fetch_loop;
      END IF;

      -- check control-break conditions
      IF YEAR(gdate) != prev_year THEN
       SET accum_F_parkfactor := 0.0;
       SET RNUMBER:= 1.0;
       SET accum_avg_F_parkfactor := 0.0;
        SET prev_year := YEAR(gdate);
      END IF;

     SET accum_F_parkfactor := accum_F_parkfactor + F_park_factor;
     SET accum_avg_F_parkfactor := accum_F_parkfactor/RNUMBER;

      UPDATE starting_pitcher_stats
        SET std_F_parkfactor =accum_avg_F_parkfactor
        WHERE Starting_Pitcher = pit_id
          AND lg_ID = lgID 
          AND YEAR_ID = YEARID
          AND Game_Date = gdate
          AND Game_Number = seq;

    END LOOP;
    CLOSE c1;
  END
$$

DELIMITER ;
LeeZee
  • 107
  • 1
  • 12