7

I have a large matrix that I would like to center:

X <- matrix(sample(1:10, 5e+08, replace=TRUE), ncol=10000)

Finding the the means is quick and efficient with colMeans:

means <- colMeans(X)

But what's a good (fast and memory efficient) way to subtract the respective mean from each column? This works, but it doesn't feel right:

for (i in 1:length(means)){
  X[,i] <- X[,i]-means[i] 
}

Is there a better way?

/edit: Here's a modification the the various benchmarks DWin wrote, on a larger matrix, including the other posted suggestions:

require(rbenchmark)
X <- matrix(sample(1:10, 5e+07, replace=TRUE), ncol=10000)
frlp.c <- compiler:::cmpfun(function(mat){
  means <- colMeans(mat)
  for (i in 1:length(means)){
    mat[,i] <- mat[,i]-means[i] 
  }
  return(mat)
})

mat.c <- compiler:::cmpfun(function(mat){
  t(t(X) - colMeans(X))
})

swp.c <- compiler:::cmpfun(function(mat){
  sweep(mat, 2, colMeans(mat), FUN='-')
})

scl.c <- compiler:::cmpfun(function(mat){
  scale(mat, scale=FALSE)
})

matmult.c <- compiler:::cmpfun(function(mat){
  mat-rep(1, nrow(mat)) %*% t(colMeans(mat))
})

benchmark( 
  frlp.c=frlp.c(X),
  mat=mat.c(X),
  swp=swp.c(X),
  scl=scl.c(X), 
  matmult=matmult.c(X),
  replications=10,
  order=c('replications', 'elapsed'))

The matmult function seems to be new winner! I really want to try these out on a 5e+08 element matrix, but I keep running out of RAM.

     test replications elapsed relative user.self sys.self user.child sys.child
5 matmult           10   11.98    1.000      7.47     4.47         NA        NA
1  frlp.c           10   35.05    2.926     31.66     3.32         NA        NA
2     mat           10   50.56    4.220     44.52     5.67         NA        NA
4     scl           10   58.86    4.913     50.26     8.42         NA        NA
3     swp           10   61.25    5.113     51.98     8.64         NA        NA
Zach
  • 29,791
  • 35
  • 142
  • 201
  • Maybe `scale` funtion could help you. See `?scale`. Another useful function could be `sweep`. – Jilber Urbina Sep 08 '12 at 16:10
  • @Jiber: the scale function is much slower than the for loop above. sweep should work though, thanks! – Zach Sep 08 '12 at 16:36
  • Who is 'wuber'? The `benchmark` function was written by Wacek Kusnierczyk. – IRTFM Sep 09 '12 at 00:49
  • @DWin: Sorry, I was referencing your post, and got the name wrong. I've been reading a lot of stuff written by `wuber` on cross-validated lately. – Zach Sep 09 '12 at 06:17

5 Answers5

6

Could this be useful for you?

sweep(X, 2, colMeans(X)) # this substracts the colMean to each col
scale(X, center=TRUE, scale=FALSE) # the same

sweep(X, 2, colMeans(X), FUN='/') # this makes division

If you want to speed up your code based on the for loop you can use cmpfun from compiler package. Example

X <- matrix(sample(1:10, 500000, replace=TRUE), ncol=100) # some data
means <- colMeans(X) # col means

library(compiler)

# One of your functions to be compiled and tested
Mean <- function(x) {
  for (i in 1:length(means)){
      X[,i] <- X[,i]-means[i] 
  }
  return(X)
}



CMean <- cmpfun(Mean) # compiling the Mean function

system.time(Mean(X))
   user  system elapsed 
  0.028   0.016   0.101 
system.time(CMean(X))
   user  system elapsed 
  0.028   0.012   0.066 

Maybe this suggestion could help you.

Jilber Urbina
  • 58,147
  • 10
  • 114
  • 138
5

This seems to be about twice as fast as sweep().

X - rep(1, nrow(X)) %*% t(colMeans(X))

X <- matrix(sample(1:10, 5e+06, replace=TRUE), ncol=10000)
system.time(sweep(X, 2, colMeans(X)))
   user  system elapsed 
   0.33    0.00    0.33 
system.time(X - rep(1, nrow(X)) %*% t(colMeans(X)))
   user  system elapsed 
   0.15    0.03    0.19 

DWin edit: When I did this with a smaller matrix than the OP used (only 5e+07) I get these timings, where Josh's is mat2 (The larger one overflowed into virtual memory on my Mac w/ 32GB and needed to be terminated) :

  test replications elapsed relative user.self sys.self user.child sys.child
2 mat2            1   0.546 1.000000     0.287    0.262          0         0
3  mat            1   2.372 4.344322     1.569    0.812          0         0
1 frlp            1   2.520 4.615385     1.720    0.809          0         0
4  swp            1   2.990 5.476190     1.959    1.043          0         0
5  scl            1   3.019 5.529304     1.984    1.046          0         0
IRTFM
  • 258,963
  • 21
  • 364
  • 487
Josh O'Brien
  • 159,210
  • 26
  • 366
  • 455
3

I can see why Jilber was uncertain regarding what you wanted, since at one point you ask for division but in your code you use subtraction. The sweep operation he suggests is superfluous here. Just using scale would do it:

 cX <- scale(X, scale=FALSE) # does the centering with subtraction of col-means
 sX <- scale(X, center=FALSE) # does the scaling operation
 csX <- scale(X) # does both

(It's hard to believe that scale is slower. Look at it's code. Uses sweep on columns)

 scale.default # since it's visible.

A matrix approach:

t( t(X) / colMeans(X) )

Edit: Some timings (I was wrong about scale being equivalent to sweep-colMeans):

require(rbenchmark)
benchmark(
    mat={sX <- t( t(X) / colMeans(X) ) },
    swp ={swX <- sweep(X, 2, colMeans(X), FUN='/')},
    scl={sX <- scale(X, center=FALSE)}, 
    replications=10^2,
    order=c('replications', 'elapsed'))
#-----------
  test replications elapsed relative user.self sys.self user.child sys.child
1  mat          100   0.015 1.000000     0.015        0          0         0
2  swp          100   0.015 1.000000     0.015        0          0         0
3  scl          100   0.025 1.666667     0.025        0          0         0

Some funny things happen when you scal this up. The above timigns were mad with samall matrix-X. Below is with something closer to what you were using:

     benchmark( 
        frlp ={means <- colMeans(X)
                       for (i in 1:length(means)){
                              X[,i] <- X[,i]-means[i] 
                                }
                      },
         mat={sX <- t( t(X) - colMeans(X) )    },
         swp ={swX <- sweep(X, 2, colMeans(X), FUN='-')},
         scl={sX <- scale(X, scale=FALSE)}, 
     replications=10^2,
     order=c('replications', 'elapsed'))
#    
  test replications elapsed relative user.self sys.self user.child sys.child
2  mat          100   2.075 1.000000     1.262    0.820          0         0
3  swp          100   2.964 1.428434     1.917    1.058          0         0
4  scl          100   2.981 1.436627     1.935    1.059          0         0
1 frlp          100   3.651 1.759518     2.540    1.128          0         0
IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • Actually, it seems that both sweep and scale are about twice as slow as my for loop. – Zach Sep 08 '12 at 16:57
  • I edited my original post. Thanks for the benchmark code. However, it seems that the for loop is actually fastest on larger matrices (5,000 rows, 10,000 cols, or 50,000 rows and 10,000 cols). – Zach Sep 08 '12 at 17:21
3

Perhaps compiling your frlp() function would speed things a bit?

frlp.c <- compiler:::cmpfun(function(mat){
              means <- colMeans(mat)
              for (i in 1:length(means)){
                mat[,i] <- mat[,i]-means[i] 
              }
              mat
            }
          )

[Edit]: for me it didn't speed things up, but I had to scale down X greatly to work on my computer. It may scale well, don't know

You may also wish to compare with JIT:

frlp.JIT <- function(mat){
              means <- colMeans(mat)
              compiler::enableJIT(2)
              for (i in 1:length(means)){
                mat[,i] <- mat[,i]-means[i] 
              }
              mat
            }
tim riffe
  • 5,651
  • 1
  • 26
  • 40
1

Here are a few more, none as fast as Josh's:

X <- matrix(runif(1e6), ncol = 1000)
matmult    <- function(mat) mat - rep(1, nrow(mat)) %*% t(colMeans(mat))
contender1 <- function(mat) mat - colMeans(mat)[col(mat)]
contender2 <- function(mat) t(apply(mat, 1, `-`, colMeans(mat)))
contender3 <- function(mat) mat - rep(colMeans(mat), each = nrow(mat))
contender4 <- function(mat) mat - matrix(colMeans(mat), nrow(mat), ncol(mat),
                                         byrow = TRUE)
benchmark(matmult(X),
          contender1(X),
          contender2(X),
          contender3(X),
          contender4(X),
          replications = 100,
          order=c('replications', 'elapsed'))
#       test replications elapsed relative user.self sys.self
# 1    matmult(X)          100    1.41 1.000000      1.39     0.00
# 5 contender4(X)          100    1.90 1.347518      1.90     0.00
# 4 contender3(X)          100    2.69 1.907801      2.69     0.00
# 2 contender1(X)          100    2.74 1.943262      2.73     0.00
# 3 contender2(X)          100    6.30 4.468085      6.26     0.03

Note that I am testing on a matrix of numerics, not integers; I think more people will find that useful (if it makes any difference.)

flodel
  • 87,577
  • 21
  • 185
  • 223