0

I have event file data from retrosheet.org. This is data on baseball games formatted such that each observation is the description of each play in each game of a baseball season (complete with reference variables for game, player, and play).

> str(e.2015.1990)
'data.frame':   4813807 obs. of  42 variables:
 $ GAME.ID                              : Factor w/ 60464 levels "ANA201504100",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ INNING                               : num  1 1 1 1 1 1 1 1 1 2 ...
 $ BATTING.TEAM                         : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 2 2 1 ...
 $ OUTS                                 : int  0 1 2 2 2 2 0 1 2 0 ...
 $ BATTER                               : Factor w/ 5107 levels "abrej003","ackld001",..: 73 167 33 120 163 100 34 256 200 209 ...
 $ BATTER.HAND                          : Factor w/ 2 levels "L","R": 2 1 2 1 2 1 1 2 2 2 ...
 $ RES.BATTER                           : Factor w/ 5107 levels "abrej003","ackld001",..: 73 167 33 120 163 100 34 256 200 209 ...
 $ RES.BATTER.HAND                      : Factor w/ 2 levels "L","R": 2 1 2 1 2 1 1 2 2 2 ...
 $ PITCHER                              : Factor w/ 3481 levels "abadf001","albem001",..: 187 187 187 187 187 187 204 204 204 187 ...
 $ PITCHER.HAND                         : Factor w/ 2 levels "L","R": 1 1 1 1 1 1 1 1 1 1 ...
 $ RES.PITCHER                          : Factor w/ 3481 levels "abadf001","albem001",..: 187 187 187 187 187 187 204 204 204 187 ...
 $ RES.PITCHER.HAND                     : Factor w/ 2 levels "L","R": 1 1 1 1 1 1 1 1 1 1 ...
 $ FIRST.RUNNER                         : Factor w/ 4369 levels "","abrej003",..: 1 1 1 1 104 140 1 1 1 1 ...
 $ SECOND.RUNNER                        : Factor w/ 4048 levels "","abrej003",..: 1 1 1 26 1 90 1 1 1 1 ...
 $ THIRD.RUNNER                         : Factor w/ 3729 levels "","ackld001",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ EVENT.TEXT                           : chr  "63/G" "6/P" "D8/L+" "S9/G.2-H" ...
 $ EVENT.TYPE                           : Factor w/ 21 levels "2","3","4","5",..: 1 1 19 18 18 1 1 1 1 1 ...
 $ AB.FLAG                              : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
 $ HIT.VALUE                            : int  1 1 3 2 2 1 1 1 1 1 ...
 $ SH.FLAG                              : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ SF.FLAG                              : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ DOUBLE.PLAY.FLAG                     : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ TRIPLE.PLAY.FLAG                     : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ RBI.ON.PLAY                          : num  0 0 0 1 0 0 0 0 0 0 ...
 $ BATTED.BALL.TYPE                     : Factor w/ 5 levels "","F","G","L",..: 3 5 4 3 4 5 3 3 5 4 ...
 $ BATTER.DEST                          : int  0 0 2 1 1 0 0 0 0 0 ...
 $ RUNNER.ON.1ST.DEST                   : int  0 0 0 0 2 1 0 0 0 0 ...
 $ RUNNER.ON.2ND.DEST                   : int  0 0 0 4 0 2 0 0 0 0 ...
 $ RUNNER.ON.3RD.DEST                   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ SB.FOR.RUNNER.ON.1ST.FLAG            : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ SB.FOR.RUNNER.ON.2ND.FLAG            : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ SB.FOR.RUNNER.ON.3RD.FLAG            : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ CS.FOR.RUNNER.ON.1ST.FLAG            : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ CS.FOR.RUNNER.ON.2ND.FLAG            : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ CS.FOR.RUNNER.ON.3RD.FLAG            : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ PO.FOR.RUNNER.ON.1ST.FLAG            : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ PO.FOR.RUNNER.ON.2ND.FLAG            : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ PO.FOR.RUNNER.ON.3RD.FLAG            : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ RESPONSIBLE.PITCHER.FOR.RUNNER.ON.1ST: Factor w/ 3433 levels "","albua001",..: 1 1 1 1 161 161 1 1 1 1 ...
 $ RESPONSIBLE.PITCHER.FOR.RUNNER.ON.2ND: Factor w/ 3408 levels "","abadf001",..: 1 1 1 133 1 133 1 1 1 1 ...
 $ RESPONSIBLE.PITCHER.FOR.RUNNER.ON.3RD: Factor w/ 3337 levels "","abadf001",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ EVENT.NUM                            : Factor w/ 177 levels "1","10","100",..: 1 90 101 112 123 134 145 156 167 2 ...

From this, I would like to calculate game totals for each player for each game. I would like to format a data frame such that each observations is a description of one player's performance in one game of the season, and each PLAYER in each game composes the entirety of the observations.

I created a new database with two columns, GAME.ID and PLAYER.ID, such that each STARTER in each game composes the entirety of the observations.

> str(k.2015.1990)
'data.frame':   1146866 obs. of  2 variables:
 $ GAME.ID  : Factor w/ 60464 levels "ANA201504100",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ PLAYER.ID: Factor w/ 4699 levels "altuj001","bettm001",..: 11 11 11 12 14 12 12 24 24 24 ...

What I need to do next is create additional vectors (for each stat I want to calculate) such that each observation of said vector creates a unique subset of my event data, defined by:

e.2015.1990$GAME.ID = k.2015.1990$GAME.ID
e.2015.1990$PLAYER.ID = k.2015.1990$PLAYER.ID

and then calculates that stat from that subset.

Aggregate() seems to work when calculating HITS from HIT.VALUE (where, for HIT.VALUE, 1 = no hit, 2 = single, 3 = double, 4 = triple, and 5= home run)

p.hit = aggregate(x = list(HIT = e.2015.1990$HIT.VALUE), by = list(GAME.ID = e.2015.1990$GAME.ID, PLAYER.ID = e.2015.1990$BATTER), FUN = function(x) sum(x > 1))

> str(p.hit)
'data.frame':   1287476 obs. of  3 variables:
 $ GAME.ID  : Factor w/ 60464 levels "ANA201504100",..: 60 61 62 63 253 269 270 373 374 375 ...
 $ PLAYER.ID: Factor w/ 5107 levels "abrej003","ackld001",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ HIT      : int  0 3 0 1 0 0 1 2 3 1 ...

However, when I adopt the formula to tally, specifically, singles:

p.single = aggregate(x = list(SINGLE = e.2015.1990$HIT.VALUE), by = list(GAME.ID = e.2015.1990$GAME.ID, PLAYER.ID = e.2015.1990$BATTER), FUN = function(x) sum(x = 2))

I get excluseivly "2"s.

> str(p.single)
'data.frame':   1287476 obs. of  3 variables:
 $ GAME.ID  : Factor w/ 60464 levels "ANA201504100",..: 60 61 62 63 253 269 270 373 374 375 ...
 $ PLAYER.ID: Factor w/ 5107 levels "abrej003","ackld001",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ SINGLE   : num  2 2 2 2 2 2 2 2 2 2 ...

Such is also the case for doubles and 3s, triples and 4s, and home-runs and 5s.

I would think there should be a way to calculate a vector such that each observation references the GAME.ID and PLAYER.ID entry on its row, searches the event file database to isolate those observations where GAME.ID = GAME.ID and PLAYER.ID = BATTER, counts the number of observations in that subset where HIT.VALUE = 2 (or = 3 for doubes, = 4 for triples, or = 5 for home runs), and then returns that count to the observation. In excel, this might be done with a CountIf() function that I could easily copy the length of the vector. I don't know how to do it in R, though.

0 Answers0