I have a large dataframe. think 100 million rows and 20 columns. of these, some of them are near duplicates --- they have the same key (or keys), but I need to combine the other 19 columns (e.g. via sums or averages). the logic of what I need to do is something like
N <- 1000000 ## times 100 in real life
mydataframe <- data.frame( key= as.integer(runif(N)*N*10), val1= runif(N), val2= runif(N))
o <- do.call("rbind", by( mydataframe, mydataframe[,1], function(dd) {
if (length(dd) == 1) return(dd)
for (coli in 2:ncol(dd)) dd[1,coli] <- sum(dd[,coli])
first( dd )
}))
print(dim(o))
this is dreadfully slow. I can speed up the by
(e.g., with split
and multicore), but I think it is the rbind
that is so slow. (I am now thinking of whether I should first break out all the duplicates and just combine them, but duplicated only gives second occurrences --- is there a function to mark all instances of duplicate keys, including the first one?)
is there a more efficient R-paradigm way to write this task?
advice appreciated.
/iaw