efficiently combining near-duplicates in R

Question

I have a large dataframe. think 100 million rows and 20 columns. of these, some of them are near duplicates --- they have the same key (or keys), but I need to combine the other 19 columns (e.g. via sums or averages). the logic of what I need to do is something like

N <-  1000000  ## times 100 in real life
mydataframe <- data.frame( key= as.integer(runif(N)*N*10), val1= runif(N), val2= runif(N))

o <- do.call("rbind", by( mydataframe, mydataframe[,1], function(dd) {
   if (length(dd) == 1) return(dd)
   for (coli in 2:ncol(dd)) dd[1,coli] <- sum(dd[,coli])
   first( dd )
}))

print(dim(o))

this is dreadfully slow. I can speed up the by (e.g., with split and multicore), but I think it is the rbind that is so slow. (I am now thinking of whether I should first break out all the duplicates and just combine them, but duplicated only gives second occurrences --- is there a function to mark all instances of duplicate keys, including the first one?)

is there a more efficient R-paradigm way to write this task?

advice appreciated.

/iaw

see `dplyr` and `data.table` solutions here: https://stackoverflow.com/questions/8212699/group-by-multiple-columns-and-sum-other-multiple-columns — mt1022, Apr 13 '22 at 02:37
You'll benefit from using `data.table` with such a large data set - try `setDT(mydataframe)[, lapply(.SD, sum), by = key]`. — Ritchie Sacramento, Apr 13 '22 at 02:37
it looks like base R is really unsuitable to this task, but data.table (and probably dplyr) work orders of magnitudes faster. thanks everyone. — ivo Welch, Apr 13 '22 at 17:43

efficiently combining near-duplicates in R

0 Answers0