I have a data frame (~5000 rows, 6 columns) that contains some duplicate values for an id
variable. I have another continuous variable x
, whose values I would like to sum for each duplicate id
. The observations are time dependent, there are year
and month
variables, and I'd like to keep the chronologically first observation of each duplicate id
and add the subsequent dupes to this first observation.
I've included dummy data that resembles what I have: dat1
. I've also included a data set that shows the structure of my desired outcome: outcome
.
I've tried two strategies, neither of which quite give me what I want (see below). The first strategy gives me the correct values for x
, but I loose my year and month columns - I need to retain these for all the first duplicate id
values. The second strategy doesn't sum the values of x
correctly.
Any suggestions for how to get my desired outcome would be much appreciated.
# dummy data set
set.seed(179)
dat1 <- data.frame(id = c(1234, 1321, 4321, 7423, 4321, 8503, 2961, 1234, 8564, 1234),
year = rep(c("2006", "2007"), each = 5),
month = rep(c("December", "January"), each = 5),
x = round(rnorm(10, 10, 3), 2))
# desired outcome
outcome <- data.frame(id = c(1234, 1321, 4321, 7423, 8503, 2961, 8564),
year = c(rep("2006", 4), rep("2007", 3)),
month = c(rep("December", 4), rep("January", 3)),
x = c(36.42, 11.55, 17.31, 5.97, 12.48, 10.22, 11.41))
# strategy 1:
library(plyr)
dat2 <- ddply(dat1, .(id), summarise, x = sum(x))
# strategy 2:
# partition into two data frames - one with unique cases, one with dupes
dat1_unique <- dat1[!duplicated(dat1$id), ]
dat1_dupes <- dat1[duplicated(dat1$id), ]
# merge these data frames while summing the x variable for duplicated ids
# with plyr
dat3 <- ddply(merge(dat1_unique, dat1_dupes, all.x = TRUE),
.(id), summarise, x = sum(x))
# in base R
dat4 <- aggregate(x ~ id, data = merge(dat1_unique, dat1_dupes,
all.x = TRUE), FUN = sum)