I have event file data from retrosheet.org. This is data on baseball games formatted such that each observation is the description of each play in each game of a baseball season (complete with reference variables for game, player, and play).
> str(e.2015.1990)
'data.frame': 4813807 obs. of 42 variables:
$ GAME.ID : Factor w/ 60464 levels "ANA201504100",..: 1 1 1 1 1 1 1 1 1 1 ...
$ INNING : num 1 1 1 1 1 1 1 1 1 2 ...
$ BATTING.TEAM : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 2 2 1 ...
$ OUTS : int 0 1 2 2 2 2 0 1 2 0 ...
$ BATTER : Factor w/ 5107 levels "abrej003","ackld001",..: 73 167 33 120 163 100 34 256 200 209 ...
$ BATTER.HAND : Factor w/ 2 levels "L","R": 2 1 2 1 2 1 1 2 2 2 ...
$ RES.BATTER : Factor w/ 5107 levels "abrej003","ackld001",..: 73 167 33 120 163 100 34 256 200 209 ...
$ RES.BATTER.HAND : Factor w/ 2 levels "L","R": 2 1 2 1 2 1 1 2 2 2 ...
$ PITCHER : Factor w/ 3481 levels "abadf001","albem001",..: 187 187 187 187 187 187 204 204 204 187 ...
$ PITCHER.HAND : Factor w/ 2 levels "L","R": 1 1 1 1 1 1 1 1 1 1 ...
$ RES.PITCHER : Factor w/ 3481 levels "abadf001","albem001",..: 187 187 187 187 187 187 204 204 204 187 ...
$ RES.PITCHER.HAND : Factor w/ 2 levels "L","R": 1 1 1 1 1 1 1 1 1 1 ...
$ FIRST.RUNNER : Factor w/ 4369 levels "","abrej003",..: 1 1 1 1 104 140 1 1 1 1 ...
$ SECOND.RUNNER : Factor w/ 4048 levels "","abrej003",..: 1 1 1 26 1 90 1 1 1 1 ...
$ THIRD.RUNNER : Factor w/ 3729 levels "","ackld001",..: 1 1 1 1 1 1 1 1 1 1 ...
$ EVENT.TEXT : chr "63/G" "6/P" "D8/L+" "S9/G.2-H" ...
$ EVENT.TYPE : Factor w/ 21 levels "2","3","4","5",..: 1 1 19 18 18 1 1 1 1 1 ...
$ AB.FLAG : logi TRUE TRUE TRUE TRUE TRUE TRUE ...
$ HIT.VALUE : int 1 1 3 2 2 1 1 1 1 1 ...
$ SH.FLAG : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ SF.FLAG : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ DOUBLE.PLAY.FLAG : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ TRIPLE.PLAY.FLAG : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ RBI.ON.PLAY : num 0 0 0 1 0 0 0 0 0 0 ...
$ BATTED.BALL.TYPE : Factor w/ 5 levels "","F","G","L",..: 3 5 4 3 4 5 3 3 5 4 ...
$ BATTER.DEST : int 0 0 2 1 1 0 0 0 0 0 ...
$ RUNNER.ON.1ST.DEST : int 0 0 0 0 2 1 0 0 0 0 ...
$ RUNNER.ON.2ND.DEST : int 0 0 0 4 0 2 0 0 0 0 ...
$ RUNNER.ON.3RD.DEST : int 0 0 0 0 0 0 0 0 0 0 ...
$ SB.FOR.RUNNER.ON.1ST.FLAG : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ SB.FOR.RUNNER.ON.2ND.FLAG : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ SB.FOR.RUNNER.ON.3RD.FLAG : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ CS.FOR.RUNNER.ON.1ST.FLAG : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ CS.FOR.RUNNER.ON.2ND.FLAG : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ CS.FOR.RUNNER.ON.3RD.FLAG : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ PO.FOR.RUNNER.ON.1ST.FLAG : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ PO.FOR.RUNNER.ON.2ND.FLAG : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ PO.FOR.RUNNER.ON.3RD.FLAG : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ RESPONSIBLE.PITCHER.FOR.RUNNER.ON.1ST: Factor w/ 3433 levels "","albua001",..: 1 1 1 1 161 161 1 1 1 1 ...
$ RESPONSIBLE.PITCHER.FOR.RUNNER.ON.2ND: Factor w/ 3408 levels "","abadf001",..: 1 1 1 133 1 133 1 1 1 1 ...
$ RESPONSIBLE.PITCHER.FOR.RUNNER.ON.3RD: Factor w/ 3337 levels "","abadf001",..: 1 1 1 1 1 1 1 1 1 1 ...
$ EVENT.NUM : Factor w/ 177 levels "1","10","100",..: 1 90 101 112 123 134 145 156 167 2 ...
From this, I would like to calculate game totals for each player for each game. I would like to format a data frame such that each observations is a description of one player's performance in one game of the season, and each PLAYER in each game composes the entirety of the observations.
I created a new database with two columns, GAME.ID and PLAYER.ID, such that each STARTER in each game composes the entirety of the observations.
> str(k.2015.1990)
'data.frame': 1146866 obs. of 2 variables:
$ GAME.ID : Factor w/ 60464 levels "ANA201504100",..: 1 2 3 4 5 6 7 8 9 10 ...
$ PLAYER.ID: Factor w/ 4699 levels "altuj001","bettm001",..: 11 11 11 12 14 12 12 24 24 24 ...
What I need to do next is create additional vectors (for each stat I want to calculate) such that each observation of said vector creates a unique subset of my event data, defined by:
e.2015.1990$GAME.ID = k.2015.1990$GAME.ID
e.2015.1990$PLAYER.ID = k.2015.1990$PLAYER.ID
and then calculates that stat from that subset.
Aggregate() seems to work when calculating HITS from HIT.VALUE (where, for HIT.VALUE, 1 = no hit, 2 = single, 3 = double, 4 = triple, and 5= home run)
p.hit = aggregate(x = list(HIT = e.2015.1990$HIT.VALUE), by = list(GAME.ID = e.2015.1990$GAME.ID, PLAYER.ID = e.2015.1990$BATTER), FUN = function(x) sum(x > 1))
> str(p.hit)
'data.frame': 1287476 obs. of 3 variables:
$ GAME.ID : Factor w/ 60464 levels "ANA201504100",..: 60 61 62 63 253 269 270 373 374 375 ...
$ PLAYER.ID: Factor w/ 5107 levels "abrej003","ackld001",..: 1 1 1 1 1 1 1 1 1 1 ...
$ HIT : int 0 3 0 1 0 0 1 2 3 1 ...
However, when I adopt the formula to tally, specifically, singles:
p.single = aggregate(x = list(SINGLE = e.2015.1990$HIT.VALUE), by = list(GAME.ID = e.2015.1990$GAME.ID, PLAYER.ID = e.2015.1990$BATTER), FUN = function(x) sum(x = 2))
I get excluseivly "2"s.
> str(p.single)
'data.frame': 1287476 obs. of 3 variables:
$ GAME.ID : Factor w/ 60464 levels "ANA201504100",..: 60 61 62 63 253 269 270 373 374 375 ...
$ PLAYER.ID: Factor w/ 5107 levels "abrej003","ackld001",..: 1 1 1 1 1 1 1 1 1 1 ...
$ SINGLE : num 2 2 2 2 2 2 2 2 2 2 ...
Such is also the case for doubles and 3s, triples and 4s, and home-runs and 5s.
I would think there should be a way to calculate a vector such that each observation references the GAME.ID and PLAYER.ID entry on its row, searches the event file database to isolate those observations where GAME.ID = GAME.ID and PLAYER.ID = BATTER, counts the number of observations in that subset where HIT.VALUE = 2 (or = 3 for doubes, = 4 for triples, or = 5 for home runs), and then returns that count to the observation. In excel, this might be done with a CountIf() function that I could easily copy the length of the vector. I don't know how to do it in R, though.