Sum duplicates then remove all but first occurrence

Question

I have a data frame (~5000 rows, 6 columns) that contains some duplicate values for an id variable. I have another continuous variable x, whose values I would like to sum for each duplicate id. The observations are time dependent, there are year and month variables, and I'd like to keep the chronologically first observation of each duplicate id and add the subsequent dupes to this first observation.

I've included dummy data that resembles what I have: dat1. I've also included a data set that shows the structure of my desired outcome: outcome.

I've tried two strategies, neither of which quite give me what I want (see below). The first strategy gives me the correct values for x, but I loose my year and month columns - I need to retain these for all the first duplicate id values. The second strategy doesn't sum the values of x correctly.

Any suggestions for how to get my desired outcome would be much appreciated.

# dummy data set
set.seed(179)
dat1 <- data.frame(id = c(1234, 1321, 4321, 7423, 4321, 8503, 2961, 1234, 8564, 1234),
                   year = rep(c("2006", "2007"), each = 5),
                   month = rep(c("December", "January"), each = 5),
                   x = round(rnorm(10, 10, 3), 2))

# desired outcome
outcome <- data.frame(id = c(1234, 1321, 4321, 7423, 8503, 2961, 8564),
                      year = c(rep("2006", 4), rep("2007", 3)),
                      month = c(rep("December", 4), rep("January", 3)),
                      x = c(36.42, 11.55, 17.31, 5.97, 12.48, 10.22, 11.41))

# strategy 1:
library(plyr)
dat2 <- ddply(dat1, .(id), summarise, x = sum(x))

# strategy 2:
# partition into two data frames - one with unique cases, one with dupes
dat1_unique <- dat1[!duplicated(dat1$id), ]
dat1_dupes <- dat1[duplicated(dat1$id), ]

# merge these data frames while summing the x variable for duplicated ids
# with plyr
dat3 <- ddply(merge(dat1_unique, dat1_dupes, all.x = TRUE),
              .(id), summarise, x = sum(x))
# in base R
dat4 <- aggregate(x ~ id, data = merge(dat1_unique, dat1_dupes,
                  all.x = TRUE), FUN = sum)

IRTFM · Accepted Answer · 2014-07-31T15:08:45.207

5

I got different sums, but it were b/c I forgot the seed:

> dat1$x <- ave(dat1$x, dat1$id, FUN=sum)
> dat1[!duplicated(dat1$id), ]
    id year    month     x
1 1234 2006 December 25.18
2 1321 2006 December 15.06
3 4321 2006 December 15.50
4 7423 2006 December  7.16
6 8503 2007  January 13.23
7 2961 2007  January  7.38
9 8564 2007  January  7.21

(To be safer It would be better to work on a copy. And you might need to add an ordering step.)

edited Jul 31 '14 at 15:08

answered Feb 08 '13 at 02:09

IRTFM

258,963
21
364
487

wow, that's great! I didn't fudge the math - I think you maybe didn't copy the seed number I put at the top of the code. I get the values I gave in my `outcome` dataset using your code when I use that seed. Thanks so much for such a neat two liner! – Chris Feb 08 '13 at 02:18
Ooops. sorry. I probably did miss the random seed being set since it was such a small data.frame. – IRTFM Feb 08 '13 at 06:20

score 3 · Answer 2 · edited Feb 08 '13 at 18:39

3

You could do this with data.table (quicker, more memory efficiently than plyr)

With a bit of self-joining fun using mult ='first'. Keying by id year and month will sort by id, year then month.

library(data.table)
DT <- data.table(dat1, key = c('id','year','month'))


# setnames is required as there are two x columns that get renamed x, x.1
DT1 <- setnames(DT[DT[,list(x=sum(x)),by=id],mult='first'][,x:=NULL],'x.1','x')

Or a simpler approach :

DT = as.data.table(dat1)

DT[,x:=sum(x),by=id][!duplicated(id)]

     id year    month     x
1: 1234 2006 December 36.42
2: 1321 2006 December 11.55
3: 4321 2006 December 17.31
4: 7423 2006 December  5.97
5: 8503 2007  January 12.48
6: 2961 2007  January 10.22
7: 8564 2007  January 11.41

edited Feb 08 '13 at 18:39

Matt Dowle

58,872
22
166
224

answered Feb 08 '13 at 01:17

mnel

113,303
27
265
254

1

thanks so much for your solutions! The above `data.table` solution works perfectly, but the syntax looks horrendous (for me - being unfamiliar with `data.table`). I noticed you had a `plyr` solution in your original answer before editing. Your `plyr` code gave the following error: `Error: length(rows) == 1 is not TRUE`. If you have time, would you mind providing a `plyr` example that works? I feel the syntax may be more understandable to me and the `data.table` performance gains are insignificant for 5k rows of data (I don't need to scale this in future). – Chris Feb 08 '13 at 02:05
@Chris, My `plyr` solution was not correct (which is why I had deleted it). If you can ensure that the data are sorted (correctly) by year then month. then `ddply(dat1, .(id), summarise, x = sum(x),month = month[1],year = year[1])` would work. – mnel Feb 08 '13 at 02:56

Sum duplicates then remove all but first occurrence

2 Answers2