7

My goal is to "sum" two not compatible matrices (matrices with different dimensions) using (and preserving) row and column names.

I've figured this approach: convert the matrices to data.table objects, join them and then sum columns vectors.

An example:

> M1
  1 3 4 5 7 8
1 0 0 1 0 0 0
3 0 0 0 0 0 0
4 1 0 0 0 0 0
5 0 0 0 0 0 0
7 0 0 0 0 1 0
8 0 0 0 0 0 0
> M2
  1 3 4 5 8
1 0 0 1 0 0
3 0 0 0 0 0
4 1 0 0 0 0
5 0 0 0 0 0
8 0 0 0 0 0
> M1 %ms% M2
  1 3 4 5 7 8
1 0 0 2 0 0 0
3 0 0 0 0 0 0
4 2 0 0 0 0 0
5 0 0 0 0 0 0
7 0 0 0 0 1 0
8 0 0 0 0 0 0

This is my code:

M1 <- matrix(c(0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0), byrow = TRUE, ncol = 6)
colnames(M1) <- c(1,3,4,5,7,8)
M2 <- matrix(c(0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0), byrow = TRUE, ncol = 5)
colnames(M2) <- c(1,3,4,5,8)
# to data.table objects
DT1 <- data.table(M1, keep.rownames = TRUE, key = "rn")
DT2 <- data.table(M2, keep.rownames = TRUE, key = "rn")
# join and sum of common columns
if (nrow(DT1) > nrow(DT2)) {
    A <- DT2[DT1, roll = TRUE]
    A[, list(X1 = X1 + X1.1, X3 = X3 + X3.1, X4 = X4 + X4.1, X5 = X5 + X5.1, X7, X8 = X8 + X8.1), by = rn]
}

That outputs:

   rn X1 X3 X4 X5 X7 X8
1:  1  0  0  2  0  0  0
2:  3  0  0  0  0  0  0
3:  4  2  0  0  0  0  0
4:  5  0  0  0  0  0  0
5:  7  0  0  0  0  1  0
6:  8  0  0  0  0  0  0

Then I can convert back this data.table to a matrix and fix row and column names.

The questions are:

  • how to generalize this procedure?

    I need a way to automatically create list(X1 = X1 + X1.1, X3 = X3 + X3.1, X4 = X4 + X4.1, X5 = X5 + X5.1, X7, X8 = X8 + X8.1) because i want to apply this function to matrices which dimensions (and row/columns names) are not known in advance.

    In summary I need a merge procedure that behaves as described.

  • there are other strategies/implementations that achieve the same goal that are, at the same time, faster and generalized? (hoping that some data.table monster help me)

  • to what kind of join (inner, outer, etc. etc.) is assimilable this procedure?

Thanks in advance.

p.s.: I'm using data.table version 1.8.2


EDIT - SOLUTIONS

@Aaron solution. No external libraries, only base R. It works also on list of matrices.

add_matrices_1 <- function(...) {
  a <- list(...)
  cols <- sort(unique(unlist(lapply(a, colnames))))
  rows <- sort(unique(unlist(lapply(a, rownames))))
  out <- array(0, dim = c(length(rows), length(cols)), dimnames = list(rows,cols))
  for (m in a) out[rownames(m), colnames(m)] <- out[rownames(m), colnames(m)] + m
  out
}

@MadScone solution. Use reshape2 package. It works only on two matrices per call.

add_matrices_2 <- function(m1, m2) {
  m <- acast(rbind(melt(M1), melt(M2)), Var1~Var2, fun.aggregate = sum)
  mn <- unique(colnames(m1), colnames(m2))
  rownames(m) <- mn
  colnames(m) <- mn
  m
}

@Aaron solution. Use Matrix package. It work only on sparse matrices, also on list of them.

add_matrices_3 <- function(...) {
  a <- list(...)
  cols <- sort(unique(unlist(lapply(a, colnames))))
  rows <- sort(unique(unlist(lapply(a, rownames))))
  nrows <- length(rows)
  ncols <- length(cols)
  newms <- lapply(a, function(m) {
    s <- summary(m)
    i <- match(rownames(m), rows)[s$i]
    j <- match(colnames(m), cols)[s$j]
    ilj <- i < j
    sparseMatrix(
      i         = ifelse(ilj, i, j),
      j         = ifelse(ilj, j, i),
      x         = s$x,
      dims      = c(nrows, ncols),
      dimnames  = list(rows, cols),
      symmetric = TRUE
    )
  })
  Reduce(`+`, newms)
}

BENCHMARK (100 runs with microbenchmark package)

Unit: microseconds
   expr                min         lq    median         uq       max
1 add_matrices_1   196.009   257.5865   282.027   291.2735   549.397
2 add_matrices_2 13737.851 14697.9790 14864.778 16285.7650 25567.448

No need to comment the benchmark: @Aaron solution wins.

Details

For insights about performances (that depend of the size and the sparsity of the matrices) see @Aaron's edit (and the solution for sparse matrices: add_matrices_3).

leodido
  • 912
  • 7
  • 21

3 Answers3

6

I'd just line up the names and go to town with base R.

Here's a simple function that takes an unspecified number of matrices and adds them up by their row/column names.

add_matrices_1 <- function(...) {
  a <- list(...)
  cols <- sort(unique(unlist(lapply(a, colnames))))
  rows <- sort(unique(unlist(lapply(a, rownames))))
  out <- array(0, dim=c(length(rows), length(cols)), dimnames=list(rows,cols))
  for(M in a) { out[rownames(M), colnames(M)] <- out[rownames(M), colnames(M)] + M }
  out
}

It then works like this:

# giving them rownames and colnames
colnames(M1) <- rownames(M1) <- c(1,3,4,5,7,8)
colnames(M2) <- rownames(M2) <- c(1,3,4,5,8)

add_matrices_1(M1, M2)
#   1 3 4 5 7 8
# 1 0 0 2 0 0 0
# 3 0 0 0 0 0 0
# 4 2 0 0 0 0 0
# 5 0 0 0 0 0 0
# 7 0 0 0 0 1 0
# 8 0 0 0 0 0 0

For bigger matrices, however, it doesn't do as well. Here's a function to make a matrix, choosing n columns out of N possibilities, and filling k spots with non-zero values. (This assumes symmetrical matrices.)

makeM <- function(N, n, k) {
  s1 <- sample(N, n)
  M1 <- array(0, dim=c(n,n), dimnames=list(s1, s1))
  r1 <- sample(n,k, replace=TRUE)
  c1 <- sample(n,k, replace=TRUE)
  M1[cbind(c(r1,c1), c(c1,r1))] <- sample(N,k)
  M1
}

Then here's another version that uses sparse matrices.

add_matrices_3 <- function(...) {
  a <- list(...)
  cols <- sort(unique(unlist(lapply(a, colnames))))
  rows <- sort(unique(unlist(lapply(a, rownames))))
  nrows <- length(rows)
  ncols <- length(cols)
  newms <- lapply(a, function(m) {
    s <- summary(m)
    i <- match(rownames(m), rows)[s$i]
    j <- match(colnames(m), cols)[s$j]
    ilj <- i<j
    sparseMatrix(i=ifelse(ilj, i, j),
                 j=ifelse(ilj, j, i),
                 x=s$x,
                 dims=c(nrows, ncols),
                 dimnames=list(rows, cols), symmetric=TRUE)
  })
  Reduce(`+`, newms)
}

This version is definitely faster when the matrices are large and sparse. (Note that I'm not timing the conversion to a sparse symmetric matrix, as hopefully if that's a suitable format, you'll use that format throughout your code.)

set.seed(50)
M1 <- makeM(10000, 5000, 50)
M2 <- makeM(10000, 5000, 50)
mm2 <- Matrix(M2)
mm1 <- Matrix(M1)
system.time(add_matrices_1(M1, M2))
#   user  system elapsed 
#  2.987   0.841   4.133 
system.time(add_matrices_3(mm1, mm2))
#   user  system elapsed 
#  0.042   0.012   0.504 

But when the matrices are small, my first solution is still faster.

set.seed(50)
M1 <- makeM(100, 50, 20)
M2 <- makeM(100, 50, 20)
mm2 <- Matrix(M2)
mm1 <- Matrix(M1)
microbenchmark(add_matrices_1(M1, M2), add_matrices_3(mm1, mm2))
# Unit: microseconds
#                       expr      min       lq   median        uq       max
# 1   add_matrices_1(M1, M2)  398.495  406.543  423.825  544.0905  43077.27
# 2 add_matrices_3(mm1, mm2) 5734.623 5937.473 6044.007 6286.6675 509584.24

Moral of the story: Size and sparsity matter.

Also, getting it right is more important than saving a few microseconds. It's almost always best to use simple functions and don't worry about speed unless you run into trouble. So in small cases, I'd prefer MadScone's solution, as it's easy to code and simple to understand. When that gets slow, I'd write a function like my first attempt. When that gets slow, I'd write a function like my second attempt.

Aaron left Stack Overflow
  • 36,704
  • 7
  • 77
  • 142
  • **Brilliant solution**. Thanks! Some suggest to implement the same behaviour with `data.table` objects ? – leodido Nov 27 '12 at 01:04
3

Here is a data.table solution. The magic is to add the .SD components (which have identical names in both) then assign the remaining column by reference.

# a function to quickly get the non key columns
nonkey <- function(DT){ setdiff(names(DT),key(DT))}
# the columns in DT1 only
notinR <- setdiff(nonkey(DT1), nonkey(DT2))

#calculate; .. means "up one level"
result <- DT2[DT1, .SD + .SD, roll= TRUE][,notinR := unclass(DT1[, ..notinR])]

# re set the column order to the original (DT1) order
setcolorder(result, names(DT1))

# voila!
result

   rn 1 3 4 5 7 8
1:  1 0 0 2 0 0 0
2:  3 0 0 0 0 0 0
3:  4 2 0 0 0 0 0
4:  5 0 0 0 0 0 0
5:  7 0 0 0 0 1 0
6:  8 0 0 0 0 0 0

I'm not convinced this is a particularly stable solution, given that I'm not sure it isn't fluking the answer because M1 and M2 are subsets of eachother


Edit, an ugly approach using eval

This is made harder because you have non-syntatic names (`1` etc)

inBoth <- intersect(nonkey(DT1), nonKey(DT2))

 backquote <- function(x){paste0('`', x, '`')}
 bqBoth <- backquote(inBoth)

 charexp <- sprintf('list(%s)',paste(c(paste0( bqBoth,'=',  bqBoth, '+ i.',inBoth), backquote(notinR)), collapse = ','))

result2 <- DT2[DT1,eval(parse(text = charexp)), roll = TRUE]
 setcolorder(result2, names(DT1))

# voila!
result2


   rn 1 3 4 5 7 8
1:  1 0 0 2 0 0 0
2:  3 0 0 0 0 0 0
3:  4 2 0 0 0 0 0
4:  5 0 0 0 0 0 0
5:  7 0 0 0 0 1 0
6:  8 0 0 0 0 0 0
MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
mnel
  • 113,303
  • 27
  • 265
  • 254
1

I think I managed to do it with this single disgusting line:

cast(aggregate(value ~ X1 + X2, rbind(melt(M1), melt(M2)), sum), X1 ~ X2)[,-1]

This makes use of the reshape package. Returned as a data frame so convert to matrix as necessary.

If you want it in the format you suggested in your example, try this:

"%ms%" <- function(m1, m2) {
  m <- as.matrix(cast(aggregate(value ~ X1 + X2, rbind(melt(m1), melt(m2)), sum), X1 ~ X2)[,-1])
  mn <- unique(colnames(m1), colnames(m2))
  rownames(m) <- mn
  colnames(m) <- mn
  return (m)
}

Then you can do:

M1 %ms% M2


EDIT:

EXPLANATION

Obviously should have some explanation sorry.

melt(M1)

Converts M1 from its original form into a format like this (row, col, value). E.g.

    1 3 4 5 7 8
  1 0 0 1 0 0 0
  3 0 0 0 0 0 0
  4 1 0 0 0 0 0
  5 0 0 0 0 0 0
  7 0 0 0 0 1 0
  8 0 0 0 0 0 0

Is converted to:

  X1 X2 value 
1  1  1     0
2  3  1     0
3  4  1     1

etc. Combining M1 and M2 lists every possible (row, col, value) across both matrix into one single matrix. Now this:

aggregate(value ~ X1 + X2, rbind(melt(M1), melt(M2)), sum)

Sums values where the row and column are the same. So it will sum (1, 1) across both matrices for example. And (3, 1) etc. It won't do anything that doesn't exist e.g. M2 doesn't have a 7th column/row.

Finally cast transforms the matrix so that it is written with the result of aggregate's first column as rows, its second column as columns. Effectively undoing the melt from earlier. The [,-1] is taking off an unnecessary column leftover from cast (I think there is probably a better way of doing that but I don't know how).

As I said, it's returned as a data frame so use as.matrix() on the result if that's what you wish.

Ciarán Tobin
  • 7,306
  • 1
  • 29
  • 45
  • 1
    Oh, that's nice. I think you could have the `cast` function do the aggregation, so the `aggregate` function would be unneeded. Using the `reshape2` package so that we can choose to get a matrix back instead of a data frame, it would be `acast(rbind(melt(M1), melt(M2)), Var1~Var2, fun.aggregate=sum)`. – Aaron left Stack Overflow Nov 27 '12 at 02:41
  • It is a good solution: thanks! However, I've done a little benchmark and reported results in the question. – leodido Nov 27 '12 at 04:53
  • Good stuff @Aaron. I'm still stuck using `reshape` for some reason. Time to upgrade methinks. – Ciarán Tobin Nov 27 '12 at 19:03