Idiomatic R code for partitioning a vector by an index and performing an operation on that partition

Question

I'm trying to find the idiomatic way in R to partition a numerical vector by some index vector, find the sum of all numbers in that partition and then divide each individual entry by that partition sum. In other words, if I start with this:

df <- data.frame(x = c(1,2,3,4,5,6), index = c('a', 'a', 'b', 'b', 'c', 'c'))

I want the output to create a vector (let's call it z):

c(1/(1+2), 2/(1+2), 3/(3+4), 3/(3+4), 5/(5+6), 6/(5+6))

If I were doing this is SQL and could use window functions, I would do this:

select 
 x / sum(x) over (partition by index) as z 
from df

and if I were using plyr, I would do something like this:

ddply(df, .(index), transform, z = x / sum(x))

but I'd like to know how to do it using the standard R functional programming tools like mapply/aggregate etc.

score 27 · Accepted Answer · edited Jun 20 '20 at 09:12

Yet another option is ave. For good measure, I've collected the answers above, tried my best to make their output equivalent (a vector), and provided timings over 1000 runs using your example data as an input. First, my answer using ave: ave(df$x, df$index, FUN = function(z) z/sum(z)). I also show an example using data.table package since it is usually pretty quick, but I know you're looking for base solutions, so you can ignore that if you want.

And now a bunch of timings:

library(data.table)
library(plyr)
dt <- data.table(df)

plyr <- function() ddply(df, .(index), transform, z = x / sum(x))
av <- function() ave(df$x, df$index, FUN = function(z) z/sum(z))
t.apply <- function() unlist(tapply(df$x, df$index, function(x) x/sum(x)))
l.apply <- function() unlist(lapply(split(df$x, df$index), function(x){x/sum(x)}))
b.y <- function() unlist(by(df$x, df$index, function(x){x/sum(x)}))
agg <- function() aggregate(df$x, list(df$index), function(x){x/sum(x)})
d.t <- function() dt[, x/sum(x), by = index]

library(rbenchmark)
benchmark(plyr(), av(), t.apply(), l.apply(), b.y(), agg(), d.t(), 
           replications = 1000, 
           columns = c("test", "elapsed", "relative"),
           order = "elapsed")
#-----

       test elapsed  relative
4 l.apply()   0.052  1.000000
2      av()   0.168  3.230769
3 t.apply()   0.257  4.942308
5     b.y()   0.694 13.346154
6     agg()   1.020 19.615385
7     d.t()   2.380 45.769231
1    plyr()   5.119 98.442308

the lapply() solution seems to win in this case and data.table() is surprisingly slow. Let's see how this scales to a bigger aggregation problem:

df <- data.frame(x = sample(1:100, 1e5, TRUE), index = gl(1000, 100))
dt <- data.table(df)

#Replication code omitted for brevity, used 100 replications and dropped plyr() since I know it 
#will be slow by comparison:
       test elapsed  relative
6     d.t()   2.052  1.000000
1      av()   2.401  1.170078
3 l.apply()   4.660  2.270955
2 t.apply()   9.500  4.629630
4     b.y()  16.329  7.957602
5     agg()  20.541 10.010234

that seems more consistent with what I'd expect.

In summary, you've got plenty of good options. Find one or two methods that work with your mental model of how aggregation tasks should work and master that function. Many ways to skin a cat.

Edit - and an example with 1e7 rows

Probably not large enough for Matt, but as big as my laptop can handle without crashing:

df <- data.frame(x = sample(1:100, 1e7, TRUE), index = gl(10000, 1000))
dt <- data.table(df)
#-----
       test elapsed  relative
6     d.t()    0.61  1.000000
1      av()    1.45  2.377049
3 l.apply()    4.61  7.557377
2 t.apply()    8.80 14.426230
4     b.y()    8.92 14.622951
5     agg()   18.20 29.83606

Glad you realised the first test was finding significant differences of insignificant times. I dunno why `benchmark` has a `replications` argument really - it just seems to encouage people to time overhead and miss the point entirely about `data.table`. — Matt Dowle, Jun 06 '12 at 10:25
Also, `1e5` isn't big enough for `data.table` to really shine. Try `1e6`,`1e7` and `1e8`. It should then be significantly faster than the next fastest (`ave()`). A `numeric` vector length `1e8` is 0.75GB, so that's starting to be size we mean by large data. At some point `ave()` will fail with 'out of memory', too, but `data.table` will continue to work. — Matt Dowle, Jun 12 '12 at 11:08
A 64bit netbook with 4GB of RAM is under £300, so `1e8` is not large really. 1e8 = 100 million = 1/20th of the (current) maximum vector length in R (2 billion). On machines with larger RAM available (say 32GB-128GB) try `1e9` (7GB). Seems like `1e10`+ will be possible in R 2.16, so _someone_ must have enough RAM (or something RAM-like). — Matt Dowle, Jun 12 '12 at 11:25
@MatthewDowle - good points. Maybe I should amend my answer to further indicate I was balancing a *larger* aggregation problem with my patience to see the results and update my answer. As is often the case, my attention span is the limiting factor :). I'm traveling today, but will update my answer later to further show `data.table()` in it's best light. — Chase, Jun 12 '12 at 11:28
@MatthewDowle - my puny laptop kept crashing at 1e8 rows, but added 1e7...as expected, data.table is starting to pull away. — Chase, Jun 22 '12 at 23:23
@Chase Ok we can work with 1e7 rows (76MB per column). If you drop it into 4th gear and try `gl(100000, 100)`, `gl(1000000, 10)` and `gl(2000000, 5)`, you should find that data.table then takes off. — Matt Dowle, Jun 23 '12 at 16:45
@Chase And using `rep(1:2000000,each=5)` might be faster to work with for testing than `gl()` which can be slow to create the factor. — Matt Dowle, Jun 23 '12 at 16:49
We could make this the canonical question to which `ave` is the answer in base. (I just marked something as a dupe of it.) Folks might be interested in seeing `dplyr`'s performance as well: `d.plyr <- function() df %>% group_by(index) %>% mutate(z=x/sum(x))`. After 10 reps, I see `d.plyr()` beating `av()` with the 10e7 data. — Frank, May 20 '15 at 17:53

score 9 · Answer 2 · answered May 25 '12 at 04:22

9

Three other approaches as well:

dat <- 1:6
lev <- rep(1:3, each = 2)

lapply(split(dat, lev), function(x){x/sum(x)})
by(dat, lev, function(x){x/sum(x)})
aggregate(dat, list(lev), function(x){x/sum(x)})

answered May 25 '12 at 04:22

Tyler Rinker

108,132
65
322
519

score 8 · Answer 3 · answered May 25 '12 at 04:00

If you're only operating on a single vector and only need a single indexing vector then tapply is quite fast

dat <- 1:6
lev <- rep(1:3, each = 2)
tapply(dat, lev, function(x){x/sum(x)})
#$`1`
#[1] 0.3333333 0.6666667
#
#$`2`
#[1] 0.4285714 0.5714286
#
#$`3`
#[1] 0.4545455 0.5454545
#
unlist(tapply(dat, lev, function(x){x/sum(x)}))
#       11        12        21        22        31        32 
#0.3333333 0.6666667 0.4285714 0.5714286 0.4545455 0.5454545

Idiomatic R code for partitioning a vector by an index and performing an operation on that partition

3 Answers3

Edit - and an example with 1e7 rows

Linked

Related