How does one aggregate and summarize data quickly?

Question

I have a dataset whose headers look like so:

PID Time Site Rep Count

I want sum the Count by Rep for each PID x Time x Site combo

on the resulting data.frame, I want to get the mean value of Count for PID x Time x Site combo.

Current function is as follows:

dummy <- function (data)
{
A<-aggregate(Count~PID+Time+Site+Rep,data=data,function(x){sum(na.omit(x))})
B<-aggregate(Count~PID+Time+Site,data=A,mean)
return (B)
}

This is painfully slow (original data.frame is 510000 20). Is there a way to speed this up with plyr?

You don't state why you have the A<- line. Are you making sure there's a weighting there? To meet your stated criterion you only need the B<- line. (plyr won't make it any faster but data.frame will) — John, Oct 11 '11 at 14:26
Right, data.table was super fast (and am so glad I learned about it). A was redundant, yeah. — Maiasaura, Oct 11 '11 at 19:42
Then the A line would explain why your aggregate was so slow. If you just run B it should go pretty quick. The data.table command provided also removes REP. It's that you were basically making a new data frame in A the same size as the original that was slowing things down so much. — John, Oct 26 '11 at 14:56

score 26 · Accepted Answer · edited Sep 21 '15 at 20:36

26

You should look at the package data.table for faster aggregation operations on large data frames. For your problem, the solution would look like:

library(data.table)
data_t = data.table(data_tab)
ans = data_t[,list(A = sum(count), B = mean(count)), by = 'PID,Time,Site']

edited Sep 21 '15 at 20:36

David Arenburg

91,361
17
137
196

answered Oct 11 '11 at 07:26

Ramnath

54,439
16
125
152

Wow, data.table does work really fast! How to get these lists A and B sorted by `PID`, `Time` and `Site` (just the way aggregate does it)? – exAres May 08 '15 at 12:19
5

@Sangram use `keyby` instead of `by` – David Arenburg Sep 21 '15 at 21:06
@DavidArenburg What is the fastest way to group aggregate for very large data tables? Is the above it? When you use keys, subsetting uses binary search which is faster is my understanding. Though it makes the code a bit less readable – wolfsatthedoor Apr 01 '17 at 18:12
@robertevansanders, It is very likely that it is faster to do aggregations on a keyed data, but I don't think I ever tested it. – David Arenburg Apr 01 '17 at 21:03

vpipkt · Answer 2 · 2015-09-21T21:07:53.113

Let's see how fast data.table is and compare to using dplyr. Thishis would be roughly the way to do it in dplyr.

data %>% group_by(PID, Time, Site, Rep) %>%
    summarise(totalCount = sum(Count)) %>%
    group_by(PID, Time, Site) %>% 
    summarise(mean(totalCount))

Or perhaps this, depending on exactly how the question is interpreted:

    data %>% group_by(PID, Time, Site) %>%
        summarise(totalCount = sum(Count), meanCount = mean(Count)

Here is a full example of these alternatives versus @Ramnath proposed answer and the one @David Arenburg proposed in the comments , which I think is equivalent to the second dplyr statement.

nrow <- 510000
data <- data.frame(PID = sample(letters, nrow, replace = TRUE), 
                   Time = sample(letters, nrow, replace = TRUE),
                   Site = sample(letters, nrow, replace = TRUE),
                   Rep = rnorm(nrow),
                   Count = rpois(nrow, 100))


library(dplyr)
library(data.table)

Rprof(tf1 <- tempfile())
ans <- data %>% group_by(PID, Time, Site, Rep) %>%
    summarise(totalCount = sum(Count)) %>%
    group_by(PID, Time, Site) %>% 
    summarise(mean(totalCount))
Rprof()
summaryRprof(tf1)  #reports 1.68 sec sampling time

Rprof(tf2 <- tempfile())
ans <- data %>% group_by(PID, Time, Site, Rep) %>%
    summarise(total = sum(Count), meanCount = mean(Count)) 
Rprof()
summaryRprof(tf2)  # reports 1.60 seconds

Rprof(tf3 <- tempfile())
data_t = data.table(data)
ans = data_t[,list(A = sum(Count), B = mean(Count)), by = 'PID,Time,Site']
Rprof()
summaryRprof(tf3)  #reports 0.06 seconds

Rprof(tf4 <- tempfile())
ans <- setDT(data)[,.(A = sum(Count), B = mean(Count)), by = 'PID,Time,Site']
Rprof()
summaryRprof(tf4)  #reports 0.02 seconds

The data table method is much faster, and the setDT is even faster!

It would be even faster if you would use `setDT(data)[,.(A = sum(Count), B = mean(Count)), by = 'PID,Time,Site']` instead of creating a copy — David Arenburg, Sep 21 '15 at 21:01
Have you checked [50-100GB benchmarks](https://github.com/Rdatatable/data.table/wiki/Benchmarks-%3A-Grouping)? — Arun, Sep 21 '15 at 21:10
Not yet, seemed more interesting in context to use a data set about the size the OP reported. — vpipkt, Sep 21 '15 at 21:15

How does one aggregate and summarize data quickly?

2 Answers2

Linked

Related