0

I have a large dataframe. think 100 million rows and 20 columns. of these, some of them are near duplicates --- they have the same key (or keys), but I need to combine the other 19 columns (e.g. via sums or averages). the logic of what I need to do is something like

N <-  1000000  ## times 100 in real life
mydataframe <- data.frame( key= as.integer(runif(N)*N*10), val1= runif(N), val2= runif(N))

o <- do.call("rbind", by( mydataframe, mydataframe[,1], function(dd) {
   if (length(dd) == 1) return(dd)
   for (coli in 2:ncol(dd)) dd[1,coli] <- sum(dd[,coli])
   first( dd )
}))

print(dim(o))

this is dreadfully slow. I can speed up the by (e.g., with split and multicore), but I think it is the rbind that is so slow. (I am now thinking of whether I should first break out all the duplicates and just combine them, but duplicated only gives second occurrences --- is there a function to mark all instances of duplicate keys, including the first one?)

is there a more efficient R-paradigm way to write this task?

advice appreciated.

/iaw

ivo Welch
  • 2,427
  • 2
  • 23
  • 31
  • 1
    see `dplyr` and `data.table` solutions here: https://stackoverflow.com/questions/8212699/group-by-multiple-columns-and-sum-other-multiple-columns – mt1022 Apr 13 '22 at 02:37
  • 1
    You'll benefit from using `data.table` with such a large data set - try `setDT(mydataframe)[, lapply(.SD, sum), by = key]`. – Ritchie Sacramento Apr 13 '22 at 02:37
  • it looks like base R is really unsuitable to this task, but data.table (and probably dplyr) work orders of magnitudes faster. thanks everyone. – ivo Welch Apr 13 '22 at 17:43

0 Answers0