2

I'm trying to compute a specific sum in R as quickly as possible. The object of interest is

enter image description here

and the relevant input objects are two L times K matrices x (contains only positive integers) and alpha (contains only positive real values). A is equivalent to rowSums(alpha) and N is equivalent to rowSums(x). Subscripts l and k denote a row / a column of alpha or x, respectively.

At first I thought it's going to be easy to come up with something that's super-quick, but I couldn't find an elegant solution. I think a matrix-valued version of seq() would be very helpful here. Does anyone have a creative solution to implement this efficiently?

Here's an easy-to-read, but obviously inefficient, loop-based version for reference:

# parameters
L = 20
K = 5

# x ... L x K matrix of integers
x = matrix(1 : (L * K), L, K)

# alpha ... L x K matrix of positive real numbers
alpha = matrix(1 : (L * K) / 100, L, K)

# N ... sum over rows of x
N = rowSums(x)

# A ... sum over rows of alpha
A = rowSums(alpha)


# implementation 

stacksum = function(x, alpha, N, A){
  
  # parameters
  K = ncol(x)
  L = nrow(x)
  
  result = 0
  
  for(ll in 1:L){
  
  # first part of sum
  first.sum = 0
  
  for(kk in 1:K){
    
    # create sequence
    sequence.k = seq(alpha[ll, kk], (alpha[ll, kk] + x[ll, kk] - 1), 1)
    
    # take logs and sum
    first.sum = first.sum + sum(log(sequence.k))
    
  }
  
  # second part of sum
  second.sum = sum(log(seq(A[ll], (A[ll] + N[ll] - 1), 1)))
  
  # add to result
  result = result + first.sum - second.sum
  
  }
  
  return(result)
  
  
}

# test
stacksum(x, alpha, N, A)
yrx1702
  • 1,619
  • 15
  • 27
  • I'm not really sure what you mean by a *"a matrix-valued version of `seq()`"* - could you say more about how that would work and why it would be helpful? – Gregor Thomas Aug 03 '22 at 18:38
  • 2
    A sum of log(k) over consecutive integers k is equal to log(n!) - log(m!), where m and n are the lower and upper bounds, respectively. Can you use that to simplify the summations? Looks like R has the function `lgamma` to compute log(gamma(x)). If nothing else, you can use that to replace the innermost summations. – Robert Dodier Aug 03 '22 at 18:46
  • 1
    @GregorThomas: One of the 'bottlenecks' is the creation of the sequences from the elements of matrix `alpha` to the elements of matrix `alpha + x - 1`, which, in my example implementation is done in a loop via `seq(alpha[ll, kk], (alpha[ll, kk] + x[ll, kk] - 1), 1)`. A first thought was that something that automatically sequences element-wise (e.g., `seq(alpha, alpha + x - 1, 1)`) could give a faster solution as most of the necessary sequences then can be computed in a single step. – yrx1702 Aug 03 '22 at 18:49
  • @RobertDodier: Thanks, that's a very interesting insight, even though the sequence is unfortunately not an integer-sequence usually. I'll nonetheless look into this! – yrx1702 Aug 03 '22 at 18:49
  • As long as j and m increment by 1, the summation is log(gamma(something)), isn't it? You have something like log (a . (a + 1) . (a + 2) . ... . (a + n)) for `a` being a positive number. Maybe I've overlooked something. – Robert Dodier Aug 03 '22 at 19:02

1 Answers1

4

Update with a lgamma solution based on @RobertDodier comments.

Using sequence and rep.int.

# parameters
L <- 20
K <- 5

# x ... L x K matrix of integers
x <- matrix(1 : (L * K), L, K)
# alpha ... L x K matrix of positive real numbers
alpha <- matrix(1 : (L * K) / 100, L, K)
# N ... sum over rows of x
N <- rowSums(x)
# A ... sum over rows of alpha
A <- rowSums(alpha)

# proposed solution
stacksum2 <- function(x, alpha, N, A) {
  sum(log(sequence(x, alpha) + rep.int(alpha %% 1, x))) - sum(log(sequence(N, A) + rep.int(A %% 1, N)))
}

# solution from Robert Dodier's comments
stacksum3 <- function(x, alpha, N, A) {
  sum(lgamma(alpha + x) - lgamma(alpha)) - sum(lgamma(A + N) - lgamma(A))
}

# OP solution
stacksum1 = function(x, alpha, N, A){
  # parameters
  K = ncol(x)
  L = nrow(x)
  result = 0
  
  for(ll in 1:L){
    # first part of sum
    first.sum = 0
    for(kk in 1:K){
      # create sequence
      sequence.k = seq(alpha[ll, kk], (alpha[ll, kk] + x[ll, kk] - 1), 1)
      # take logs and sum
      first.sum = first.sum + sum(log(sequence.k))
    }
    # second part of sum
    second.sum = sum(log(seq(A[ll], (A[ll] + N[ll] - 1), 1)))
    # add to result
    result = result + first.sum - second.sum
  }
  result
}
res <- list(
  stacksum1(x, alpha, N, A),
  stacksum2(x, alpha, N, A),
  stacksum3(x, alpha, N, A)
)

all.equal(res[1:2], res[-1])
#> [1] TRUE

microbenchmark::microbenchmark(stacksum1 = stacksum1(x, alpha, N, A),
                               stacksum2 = stacksum2(x, alpha, N, A),
                               stacksum3 = stacksum3(x, alpha, N, A),
                               check = "equal")
#> Unit: microseconds
#>       expr    min      lq     mean  median      uq    max neval
#>  stacksum1 1654.2 1704.60 1899.384 1740.80 1964.75 4234.4   100
#>  stacksum2  238.2  246.45  258.284  252.35  268.40  319.4   100
#>  stacksum3   18.5   19.05   20.981   20.55   21.70   36.4   100
jblood94
  • 10,340
  • 1
  • 10
  • 15
  • Very nice answer! Also bizarre that the `stacksum3` benchmark has one giant outlier dragging the mean all the way to 45 when the median is 20 and the uq is 22! It's like your computer stalled on a single iteration. – Gregor Thomas Aug 04 '22 at 12:59
  • After looking into it, it appears to be the first iteration in a new session (`reprex::reprex`). The outlier goes away if I run each function before benchmarking and (bizarrely) keep the `check = "equal"` argument. I updated the answer. – jblood94 Aug 04 '22 at 15:14