2

I am trying to go from a matrix that has columns that "belong together" to one where the row-sums of the relevant sub-matrices have been formed. I.e. going from

     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16]
[1,]    1    5    9   13   17   21   25   29   33    37    41    45    49    53    57    61
[2,]    2    6   10   14   18   22   26   30   34    38    42    46    50    54    58    62
[3,]    3    7   11   15   19   23   27   31   35    39    43    47    51    55    59    63
[4,]    4    8   12   16   20   24   28   32   36    40    44    48    52    56    60    64

to

     [,1] [,2] [,3] [,4] [,5]
[1,]   15   30   46  185  220
[2,]   18   32   48  190  224
[3,]   21   34   50  195  228
[4,]   24   36   52  200  232

I assume there must be some much more elegant and faster way to do this than by looping over the indices as I do below (particularly, my real matrix would be more like 4000 by many thousands).

example <- matrix(1:64, nrow=4) myindex <- c(1,1,1,2,2,3,3,4,4,4,4,4,5,5,5,5) summed <- matrix( rep(unique(myindex), each=dim(example)[1]), nrow=dim(example)[1]) for (i in 1:length(unique(myindex))){ summed[,i] <- apply(X=example[,(myindex==i)], MARGIN=1, FUN=sum) }

It is probably my lack of experience with apply and tapply that prevents me from figuring this out. A fast dplyr approach would of course also be welcome.

Björn
  • 644
  • 10
  • 23

4 Answers4

3

We can use a one liner with sapply:

sapply(unique(myindex), function(x) rowSums(example[, which(myindex == x), drop = FALSE]))

     [,1] [,2] [,3] [,4] [,5]
[1,]   15   30   46  185  220
[2,]   18   32   48  190  224
[3,]   21   34   50  195  228
[4,]   24   36   52  200  232

We let sapply loop over all unique values of myindex, and use which to define the columns which should be included into the rowSums.


Edit: Included drop = FALSE to prevent single indexes from simplifying to vector. Thanks @mt1022 for pointing out the bug!

LAP
  • 6,605
  • 2
  • 15
  • 28
  • 1
    Thank you for one of the quick responses on my 4000 by 3020 example it seems consistently the fastest out of the 3 proposed answers, so I will mark it as the accepted answer. – Björn Mar 28 '18 at 13:50
3

We can also do this by splitting

sapply(split.default(as.data.frame(example), myindex), rowSums)
#     1  2  3   4   5
#[1,] 15 30 46 185 220
#[2,] 18 32 48 190 224
#[3,] 21 34 50 195 228
#[4,] 24 36 52 200 232
akrun
  • 874,273
  • 37
  • 540
  • 662
3

Another approach...

example <- matrix(1:64, nrow=4)
myindex <- c(1,1,1,2,2,3,3,4,4,4,4,4,5,5,5,5)

summed <- t(apply(example,1,cumsum))
summed <- summed[,cumsum(rle(myindex)$lengths)]
summed[,-1] <- t(apply(summed,1,diff))
summed

     [,1] [,2] [,3] [,4] [,5]
[1,]   15   30   46  185  220
[2,]   18   32   48  190  224
[3,]   21   34   50  195  228
[4,]   24   36   52  200  232
Andrew Gustar
  • 17,295
  • 1
  • 22
  • 32
2

An altenative approach with matrix multiplication (less efficient for large dataset):

x <- matrix(0, nrow = ncol(example), ncol = max(myindex))
x[cbind(1:ncol(example), myindex)] <- 1
example %*% x

#      [,1] [,2] [,3] [,4] [,5]
# [1,]   15   30   46  185  220
# [2,]   18   32   48  190  224
# [3,]   21   34   50  195  228
# [4,]   24   36   52  200  232

Here is a benchmark with the an example data matching the real data size:

library(microbenchmark)

n_row <- 4000
n_col <- 3020
example <- matrix(rnorm(n_row * n_col), nrow = n_row)
myindex <- ceiling((1:n_col)/5)

microbenchmark(
    matrix = {
        x <- matrix(0, nrow = ncol(example), ncol = max(myindex))
        x[cbind(1:ncol(example), myindex)] <- 1
        example %*% x
    },
    split = {  # by akrun
        sapply(split.default(as.data.frame(example), myindex), rowSums)
    },
    which = {  # by LAP
        sapply(unique(myindex), function(x) rowSums(example[, which(myindex == x)]))
    },
    times = 10
)

# Unit: milliseconds
#    expr       min        lq     mean    median       uq      max neval
#  matrix 982.55727 989.65177 992.7295 992.91230 997.3704 999.0066    10
#   split 162.13377 162.57711 194.5668 167.92963 182.5335 403.8740    10
#   which  90.28227  94.82681 119.3977  96.03701 103.1125 316.9170    10
mt1022
  • 16,834
  • 5
  • 48
  • 71
  • Thank you. The microbenchpackage is really interesting. Interestingly with my real example (4000 by 3020), I ended up with Unit: milliseconds expr lq mean median uq matrix 13040.5662 13535.0503 13874.2902 14135.7251 split 2379.6066 3173.5876 2631.1031 3371.2384 which 204.2357 322.9715 254.2363 383.8717 – Björn Mar 28 '18 at 13:53
  • @Björn, It seems the matrix approach is too slow on the real dataset. I thought is should be faster. – mt1022 Mar 28 '18 at 13:57