9

I need the weighted sum of each column of a matrix.

data <- matrix(1:2e7,1e7,2) # warning large number, will eat up >100 megs of memory
weights <- 1:1e7/1e5
system.time(colSums(data*weights))
system.time(apply(data,2,function(x) sum(x*weights)))
all.equal(colSums(data*weights), apply(data,2,function(x) sum(x*weights)))

Typically colSums(data*weights) is faster than the apply call.

I do this operation often (on a large matrix). Hence looking for advice on the most efficient implementation. Ideally, would have been great if we could pass weights to colSums (or rowSums).

Thanks, appreciate any insights!

mnel
  • 113,303
  • 27
  • 265
  • 254
Anirban
  • 271
  • 2
  • 8

2 Answers2

8

colSums and * are both internal or primitive functions and will be much faster than the apply approach

Another approach you could try is to use some basic matrix algebra as you are looking for

 weights %*% data

The matrix multiplication method does not appear to be faster but it will avoid creating a temporary object the size of data

system.time({.y <- colSums(data * weights)})
##  user  system elapsed 
##  0.12    0.03    0.16 


system.time({.x <- weights %*% data})
##   user  system elapsed 
##   0.20    0.05    0.25 
mnel
  • 113,303
  • 27
  • 265
  • 254
3

Rcpp leads to a performance gain (particularly with a larger number of columns).

library(Rcpp)
library(inline)
src <- '
 Rcpp::NumericMatrix dataR(data);
 Rcpp::NumericVector weightsR(weights);
 int ncol = dataR.ncol();
 Rcpp::NumericVector sumR(ncol);
 for (int col = 0; col<ncol; col++){
   sumR[col] = Rcpp::sum(dataR( _, col)*weightsR);
 }
 return Rcpp::wrap(sumR);'

weighted.colSums <- cxxfunction(
  signature(data="numeric", weights="numeric"), src, plugin="Rcpp")
data <- matrix(as.numeric(1:1e7),1e5,100) # warning large object
weights <- 1:1e5/1e5
all.equal(colSums(data*weights), weighted.colSums(data, weights))
## [1] TRUE
print(system.time(colSums(data*weights)))
##   user  system elapsed 
##  0.065   0.001   0.064 
print(system.time(as.vector(weighted.colSums(data, weights))))
##   user  system elapsed 
##  0.019   0.001   0.019 
all.equal(as.vector(weights %*% data), weighted.colSums(data, weights))
## [1] TRUE
print(system.time(weights %*% data))
##   user  system elapsed 
##  0.066   0.001   0.066 
Anirban
  • 271
  • 2
  • 8