Loop or apply for sum of rows based on multiple conditions in R dataframe

Question

I've hacked together a quick solution to my problem, but I have a feeling it's quite obtuse. Moreover, it uses for loops, which from what I've gathered, should be avoided at all costs in R. Any and all advice to tidy up this code is appreciated. I'm still pretty new to R, but I fear I'm making a relatively simple problem much too convoluted.

I have a dataset as follows:

id  count   group
2   6   A
2   8   A
2   6   A
8   5   A
8   6   A
8   3   A
10  6   B
10  6   B
10  6   B
11  5   B
11  6   B
11  7   B
16  6   C
16  2   C
16  0   C
18  6   C
18  1   C
18  6   C

I would like to create a new dataframe that contains, for each unique ID, the sum of the first two counts of that ID (e.g. 6+8=14 for ID 2). I also want to attach the correct group identifier.

In general you might need to do this when you measure a value on consecutive days for different subjects and treatments, and you want to compute the total for each subject for the first x days of measurement.

This is what I've come up with:

id <- c(rep(c(2,8,10,11,16,18),each=3))
count <- c(6,8,6,5,6,3,6,6,6,5,6,7,6,2,0,6,1,6)
group <- c(rep(c("A","B","C"),each=6))
df <- data.frame(id,count,group)

newid<-c()
newcount<-c()
newgroup<-c()
for (i in 1:length(unique(df$"id"))) {
  newid[i] <- unique(df$"id")[i]
  newcount[i]<-sum(df[df$"id"==unique(df$"id")[i],2][1:2])
  newgroup[i] <- as.character(df$"group"[df$"id"==newid[i]][1])
}

newdf<-data.frame(newid,newcount,newgroup)

Some possible improvements/alternatives I'm not sure about:

For loops vs apply functions
Can I create a dataframe directly inside a for loop or should I stick to creating vectors I can late assign to a dataframe?
More consistent approaches to accessing/subsetting vectors/columns ($, [], [[]], subset?)

Hao · Answer 1 · 2015-04-07T13:59:44.693

You can try to use a self-defined function in aggregate

sum1sttwo<-function (x){
  return(x[1]+x[2])
}
aggregate(count~id+group, data=df,sum1sttwo)

and the output is:

  id group count
1  2     A    14
2  8     A    11
3 10     B    12
4 11     B    11
5 16     C     8
6 18     C     7

04/2015 edit: dplyr and data.table are definitely better choices when your data set is large. One of the most important disadvantages of base R is that dataframe is too slow. However, if you just need to aggregate a very simple/small data set, the aggregate function in base R can serve its purpose.

score 1 · Answer 2 · answered Mar 03 '15 at 23:17

1

You could use dplyr:

library(dplyr)
df %>% group_by(id,group) %>% slice(1:2) %>% summarise(newcount=sum(count))

The pipe syntax makes it easy to read: group your data by id and group, take the first two rows for each group, then sum the counts

answered Mar 03 '15 at 23:17

NicE

21,165
3
51
68

score 1 · Accepted Answer · edited Mar 04 '15 at 06:44

1

You could do this using data.table

setDT(df)[, list(newcount = sum(count[1:2])), by = .(id, group)]
#    id group newcount
#1:  2     A       14
#2:  8     A       11
#3: 10     B       12
#4: 11     B       11
#5: 16     C        8
#6: 18     C        7

edited Mar 04 '15 at 06:44

Arun

116,683
26
284
387

answered Mar 04 '15 at 04:16

akrun

874,273
37
540
662

score 0 · Answer 4 · answered Mar 03 '15 at 22:37

    library(plyr)

    -Keep first 2 rows for each group and id
    df2 <-  ddply(df, c("id","group"), function (x) x$count[1:2])

    -Aggregate by group and id
    df3 <- ddply(df2, c("id", "group"), summarize, count=V1+V2)

    df3
    id group count
  1  2     A    14
  2  8     A    11
  3 10     B    12
  4 11     B    11
  5 16     C     8
  6 18     C     7

Loop or apply for sum of rows based on multiple conditions in R dataframe

4 Answers4