0

I have been working on the following csv file from http://www3.amherst.edu/~nhorton/r2/datasets/Batting.csv just for my own practice.

I am however not sure of how to do the following:

  • Summarize the observations from the same team (teamID) in the same year by adding up the component values. That is, you should end up with only one record per team per year, and this record should have the year, team name, total runs, total hits, total X2B ,…. Total HBP.

Here is the code I have so far but it is only giving me only one team per year yet I need all the teams for each year with their totals (e.g, for 1980, I need all the teams with totalruns,totalhits,.....,for 1981, all the teams with totalruns,totalhits,.... and so on)

newdat1 <- read.csv("http://www3.amherst.edu/~nhorton/r2/datasets/Batting.csv")

id <- split(1:nrow(newdata1), newdata1$yearID)

a2 <- data.frame(yearID=sapply(id, function(i)  newdata1$yearID[i[1]]),
    teamID=sapply(id,function(i) newdata$teamID[i[1]]),
        totalRuns=sapply(id, function(i) sum(newdata1$R[i],na.rm=TRUE)), 
        totalHits=sapply(id, function(i) sum(newdata1$H[i],na.rm=TRUE)),
    totalX2B=sapply(id, function(i) sum(newdata1$X2B[i],na.rm=TRUE)),
    totalX3B=sapply(id, function(i) sum(newdata1$X3B[i],na.rm=TRUE)),
    totalHR=sapply(id, function(i) sum(newdata1$HR[i],na.rm=TRUE)),
    totalBB=sapply(id, function(i) sum(newdata1$BB[i],na.rm=TRUE)), 
    totalSB=sapply(id, function(i) sum(newdata1$SB[i],na.rm=TRUE)),
    totalGIDP=sapply(id, function(i) sum(newdata1$GIDP[i],na.rm=TRUE)),
    totalIBB=sapply(id, function(i) sum(newdata1$IBB[i],na.rm=TRUE)),
    totalHBP=sapply(id, function(i) sum(newdata1$HBP[i],na.rm=TRUE)))
a2
user20650
  • 24,654
  • 5
  • 56
  • 91
Ken Stats
  • 11
  • 3
  • I would search for questions using `aggregate` .. .a starter `tot = aggregate(HBP ~ teamID + yearID, newdata1, sum)`. You can have more than one column on the lhs of the formula – user20650 Oct 22 '15 at 14:16
  • 1
    That's quite a lot of `sapply`s :) – talat Oct 22 '15 at 14:16

3 Answers3

1

Perhaps try something like:

library("dplyr")
newdata1 %>%
    group_by(yearID, teamID) %>%
    summarize_each(funs(sum(., na.rm = T)), R, H, X2B, 
                      X3B, HR, BB, SB, GIDP, IBB, HBP)

Naturally this is most useful if you're comfortable with dplyr library. This is a guess without looking at the data too closely.

Also, instead of listing each column that you wish to sum over, you can use alternatively do

 summarize_each(funs(sum(., na.rm = T)), -column_to_exclude1, -column_to_exlude2)

And so forth.

Josh W.
  • 1,123
  • 1
  • 10
  • 17
0

I'd suggest looking at ddply in the plyr package. See here for a good explanation of what I think you're trying to do.

For this example try the following code:

# ddply function in the plyr package

library(plyr)

# summarize the dataframe newdat1, using yearID and teamID as grouping variables

outputdat <-ddply(newdat1, c("yearID", "teamID"), summarize, 
            totalRuns= sum(R),  # add all summary variables you need here...
            totalHits= sum(H),  # other summary functions (mean, sd etc) also work
            totalX2B = sum(X2B))

Hope that helps?

Adam Kimberley
  • 879
  • 5
  • 12
0
library(plyr)
ddply(newdat1, ~ teamID + yearID, summarize, sum(R), sum(X2B), sum(SO), sum(IBB), sum(HBP))

eventually sum(..., na.rm=TRUE)

also data.table{} can do that:

library(data.table)
DT <- as.data.table(newdat1[,-c(1,5)])
setkey(DT, teamID, yearID)
DT[, lapply(.SD, sum, na.rm=TRUE), .(teamID, yearID)]
jogo
  • 12,469
  • 11
  • 37
  • 42