8

I have 3 vectors:

x <- c(3, 5, 2)
y <- c(3, 2, 1,  1, 2, 3, 4, 5,  4, 5)
z <- c(2, 4, 8, 1, 5)

x is the number of elements in each group.

y gives indices to extract elements from z. The first three indices belong to group 1 (corresponding to first element in x, 3); the next five indices belong to group 2 (the second element in x, 5), and so on.

z is values from which to extract values using indices in y, and then summed by group.

For each element in x, I want to sum the elements in z indexed by corresponding indices in y. For example:

i = 1; indices = 3, 2, 1; sum = 8 + 4 + 2

i = 2; indices = 1, 2, 3, 4, 5; sum = 2 + 4 + 8 + 1 + 5

i = 3; indices = 4, 5; sum = 1 + 5

I've thought of creating a new vector with the same length as y and elements are from z with their respective indices but this loop lies within a bigger program and it may get too complicated. I'd really appreciate your advices on how to specify conditions on indices to get the sum.

Thanks!

Henrik
  • 65,555
  • 14
  • 143
  • 159
Sarah
  • 81
  • 3

4 Answers4

4

First index z with y to get a vector of elements you want to sum. Then create a group index from x, and use tapply() to sum in each group:

x <- c(3, 5, 2)
y <- c(3, 2, 1, 1, 2, 3, 4, 5, 4, 5)
z <- c(2, 4, 8, 1, 5)

g <- rep(seq_along(x), x)
tapply(z[y], g, sum)
#>  1  2  3 
#> 14 20  6
Mikko Marttila
  • 10,972
  • 18
  • 31
3

A vectorized solution with base R:

diff(c(0, cumsum(z[y])[cumsum(x)]))
#> [1] 14 20  6

Just for fun, let's benchmark the different solutions proposed using a larger dataset (all integer values so the results of the four functions are identical):

f1 <- function(x, y, z) diff(c(0L, cumsum(z[y])[cumsum(x)]))
f2 <- function(x, y, z) as.integer(tapply(z[y], rep(seq_along(x), x), sum))
f3 <- function(x, y, z) sapply(unname(split(y, rep(1:length(x), x))), function(x) sum(z[x]))
# @Mael's for loop answer modified for speed
f4 <- function(x, y, z) {
  s <- integer(length(x))
  cx <- cumsum(x)
  s[1] <- sum(z[y[seq(x[1])]])
  for(i in 2:length(x)) s[i] <- sum(z[y[seq(cx[i - 1L] + 1L, cx[i])]])
  s
}

x <- sample(10, 1e4, TRUE)
y <- unlist(mapply(function(x) sample(10, x), x))
z <- sample(10)

microbenchmark::microbenchmark(f1(x, y, z),
                               f2(x, y, z),
                               f3(x, y, z),
                               f4(x, y, z),
                               check = "equal")
#> Unit: microseconds
#>         expr     min       lq      mean   median       uq     max neval
#>  f1(x, y, z)   221.7   237.35   269.056   246.75   259.25  1621.3   100
#>  f2(x, y, z)  8659.5  8966.05  9436.873  9097.60  9551.10 13567.7   100
#>  f3(x, y, z)  9960.1 10746.35 11759.939 11030.05 12242.85 43611.5   100
#>  f4(x, y, z) 47574.7 50506.10 51927.481 51728.70 53416.30 58262.5   100
jblood94
  • 10,340
  • 1
  • 10
  • 15
  • Nice way to take advantage of the inherent ordering in the question! Your answer reminded me of this series of blog posts with more about these sorts of group statistics optimizations: https://www.brodieg.com/2019/06/10/base-vs-data-table/#group-sums – Mikko Marttila Jul 16 '22 at 10:56
1

With a for loop:

s <- c()
for(i in seq(x)){
  if(i == 1){
    idx <- seq(x[i])
  } else {
    idx <- seq(from = cumsum(x)[i - 1] + 1, to = cumsum(x)[i])
  }
  s <- c(s, sum(z[y[idx]]))
}

output

> s
# [1] 14 20  6
Maël
  • 45,206
  • 3
  • 29
  • 67
1

Another possible solution, in base R:

sapply(unname(split(y, rep(1:length(x), x))), \(x) sum(z[x]))

#> [1] 14 20  6

PaulS
  • 21,159
  • 2
  • 9
  • 26