Summarize observations in r

Question

I have been working on the following csv file from http://www3.amherst.edu/~nhorton/r2/datasets/Batting.csv just for my own practice.

I am however not sure of how to do the following:

Summarize the observations from the same team (teamID) in the same year by adding up the component values. That is, you should end up with only one record per team per year, and this record should have the year, team name, total runs, total hits, total X2B ,…. Total HBP.

Here is the code I have so far but it is only giving me only one team per year yet I need all the teams for each year with their totals (e.g, for 1980, I need all the teams with totalruns,totalhits,.....,for 1981, all the teams with totalruns,totalhits,.... and so on)

newdat1 <- read.csv("http://www3.amherst.edu/~nhorton/r2/datasets/Batting.csv")

id <- split(1:nrow(newdata1), newdata1$yearID)

a2 <- data.frame(yearID=sapply(id, function(i)  newdata1$yearID[i[1]]),
    teamID=sapply(id,function(i) newdata$teamID[i[1]]),
        totalRuns=sapply(id, function(i) sum(newdata1$R[i],na.rm=TRUE)), 
        totalHits=sapply(id, function(i) sum(newdata1$H[i],na.rm=TRUE)),
    totalX2B=sapply(id, function(i) sum(newdata1$X2B[i],na.rm=TRUE)),
    totalX3B=sapply(id, function(i) sum(newdata1$X3B[i],na.rm=TRUE)),
    totalHR=sapply(id, function(i) sum(newdata1$HR[i],na.rm=TRUE)),
    totalBB=sapply(id, function(i) sum(newdata1$BB[i],na.rm=TRUE)), 
    totalSB=sapply(id, function(i) sum(newdata1$SB[i],na.rm=TRUE)),
    totalGIDP=sapply(id, function(i) sum(newdata1$GIDP[i],na.rm=TRUE)),
    totalIBB=sapply(id, function(i) sum(newdata1$IBB[i],na.rm=TRUE)),
    totalHBP=sapply(id, function(i) sum(newdata1$HBP[i],na.rm=TRUE)))
a2

I would search for questions using `aggregate` .. .a starter `tot = aggregate(HBP ~ teamID + yearID, newdata1, sum)`. You can have more than one column on the lhs of the formula — user20650, Oct 22 '15 at 14:16

Josh W. · Answer 1 · 2015-10-22T14:36:45.527

Perhaps try something like:

library("dplyr")
newdata1 %>%
    group_by(yearID, teamID) %>%
    summarize_each(funs(sum(., na.rm = T)), R, H, X2B, 
                      X3B, HR, BB, SB, GIDP, IBB, HBP)

Naturally this is most useful if you're comfortable with dplyr library. This is a guess without looking at the data too closely.

Also, instead of listing each column that you wish to sum over, you can use alternatively do

 summarize_each(funs(sum(., na.rm = T)), -column_to_exclude1, -column_to_exlude2)

And so forth.

score 0 · Answer 2 · answered Oct 22 '15 at 14:27

I'd suggest looking at ddply in the plyr package. See here for a good explanation of what I think you're trying to do.

For this example try the following code:

# ddply function in the plyr package

library(plyr)

# summarize the dataframe newdat1, using yearID and teamID as grouping variables

outputdat <-ddply(newdat1, c("yearID", "teamID"), summarize, 
            totalRuns= sum(R),  # add all summary variables you need here...
            totalHits= sum(H),  # other summary functions (mean, sd etc) also work
            totalX2B = sum(X2B))

Hope that helps?

I have tried the code above and it works fine. – Ken Stats Oct 22 '15 at 18:40 — Ken Stats, Oct 22 '15 at 18:40

jogo · Answer 3 · 2015-10-23T11:51:26.680

0

library(plyr)
ddply(newdat1, ~ teamID + yearID, summarize, sum(R), sum(X2B), sum(SO), sum(IBB), sum(HBP))

eventually sum(..., na.rm=TRUE)

also data.table{} can do that:

library(data.table)
DT <- as.data.table(newdat1[,-c(1,5)])
setkey(DT, teamID, yearID)
DT[, lapply(.SD, sum, na.rm=TRUE), .(teamID, yearID)]

edited Oct 23 '15 at 11:51

answered Oct 22 '15 at 14:27

jogo

12,469
11
37
42

Summarize observations in r

3 Answers3