4

I'm new to R (and to stackoverflow) and I would appreciate your help. I would like to count the number of occurences of each unique column in a matrix. I have written the following code, but it is extremely slow :

frequencyofequalcolumnsinmatrix = function(matrixM){

# returns a matrix columnswithfrequencyofmtxM that contains each distinct column and the frequency of each distinct columns on the last row. Hence  if the last row is c(3,5,3,2), then matrixM has 3+5+3+2=13 columns; there are 4 distinct columns; and the first distinct column appears 3 times, the second distinct column appears 5 times, etc.


n = nrow(matrixM)

columnswithfrequencyofmtxM = c()

while (ncol(matrixM)>0){

  indexzero = which(apply(matrixM-matrixM[,1], 2, function(x) identical(as.vector(x),rep(0,n))));

  indexnotzero = setdiff(seq(1:ncol(matrixM)),indexzero);

  frequencyofgivencolumn = c(matrixM[,1], length(indexzero)); #vector of length n. Coordinates 1 to nrow(matrixM) contains the coordinates of the given distinct column while coordinate nrow(matrixM)+1 contains the frequency of appearance of that column

  columnswithfrequencyofmtxM = cbind(columnswithfrequencyofmtxM,frequencyofgivencolumn, deparse.level=0);

  matrixM=matrixM[,indexnotzero];

  matrixM = as.matrix(matrixM);

  }

return(columnswithfrequencyofmtxM)


} 

If we apply on the matrix 'testmtx', we obtain:

> testmtx = matrix(c(1,2,4,0,1,1,1,2,1,1,2,4,0,1,1,0,1,1), nrow=3, ncol=6)
> frequencyofequalcolumnsinmatrix(testmtx)
     [,1] [,2] [,3]
[1,]    1    0    1
[2,]    2    1    2
[3,]    4    1    1
[4,]    2    3    1

where the last row contains the number of occurrences of the column above.

Unhappy with my code, I browsed through stackoverflow. I found the following Question:

Fastest way to count occurrences of each unique element

It is shown that the fastest way to count occurrences of each unique element of a vector is through the use of the data.table() package. Here is the code:

f6 <- function(x){
data.table(x)[, .N, keyby = x]
}

When we run it we obtain:

> vtr = c(1,2,3,1,1,2,4,2,4)
> f6(vtr)
   x N
1: 1 3
2: 2 3
3: 3 1
4: 4 2

I have tried to modify this code in order to use it in my case. This requires to be able to create vtr as a vector in which each element is a vector. But I haven't been able to do that.(Most likely because in R, c(c(1,2),c(3,4)) is the same as c(1,2,3,4)).

Should I try to modify the function f6? If so, how?
Or should I take a completely different approach? IF so, which one?

Thank you!

Community
  • 1
  • 1
Gaël Giordano
  • 155
  • 1
  • 7

5 Answers5

3

One simple way would be to just paste your rows together in to a vector and then use the function.

mat <- matrix(c(1,2,4,0,1,1,1,2,1,1,2,4,0,1,1,0,1,1), nrow=3, ncol=6)

vec <- apply(mat, 2, paste, collapse=" ")

f6(vec)
     x N
1: 011 3
2: 121 1
3: 124 2

EDIT

The answer by @RohitDas made me think, when thinking about performance it is always best to check. If I take all the functions previously shown in the question the OP linked here and add

f7 <- table

Also adding f10 suggestion by @DavidArenburg

f10 <- function(x){ 
  table(unlist(data.table(x)[, lapply(.SD, paste, collapse = "")])) 
}

Here are the results:

After adding the solution by @MaratTalipov, it is the clear winner. Applied directly on the matrix it is faster than all the vector solutions.

set.seed(1)
testmx <- matrix(sample(1:10, 3 * 1e3, rep=T), nrow=1000)

microbenchmark(
   f1(apply(testmx, 2, paste, collapse=" ")),
   f2(apply(testmx, 2, paste, collapse=" ")),
   f3(apply(testmx, 2, paste, collapse=" ")),
   f4(apply(testmx, 2, paste, collapse=" ")),
   f5(apply(testmx, 2, paste, collapse=" ")),
   f6(apply(testmx, 2, paste, collapse=" ")),
   f7(apply(testmx, 2, paste, collapse=" ")),
   f8(apply(testmx, 2, paste, collapse=" ")),
   f9(apply(testmx, 2, paste, collapse=" ")),
   f10(testmx),
   f11(testmx),
   f12(testmx)
   )
Unit: microseconds
                                       expr      min        lq      mean   median        uq       max neval
 f1(apply(testmx, 2, paste, collapse = " ")) 3311.770 3511.5620 3901.0020 3612.035 3849.3600  9569.987   100
 f2(apply(testmx, 2, paste, collapse = " ")) 3044.997 3263.6515 3667.9232 3430.914 3847.2430  6721.318   100
 f3(apply(testmx, 2, paste, collapse = " ")) 2032.179 2118.0245 2371.8638 2213.301 2430.4155  6631.624   100
 f4(apply(testmx, 2, paste, collapse = " ")) 2119.949 2218.3050 2497.1513 2286.442 2425.0260  6258.987   100
 f5(apply(testmx, 2, paste, collapse = " ")) 2131.498 2221.5775 2459.9300 2309.925 2530.3115  4222.575   100
 f6(apply(testmx, 2, paste, collapse = " ")) 3121.217 3367.7815 3738.3239 3486.155 3835.1175  7979.352   100
 f7(apply(testmx, 2, paste, collapse = " ")) 1766.175 1832.9650 2040.5483 1889.169 2032.1795  3784.110   100
 f8(apply(testmx, 2, paste, collapse = " ")) 2085.303 2169.2240 2435.6932 2237.168 2404.2380  5002.109   100
 f9(apply(testmx, 2, paste, collapse = " ")) 2802.090 2988.0230 3449.0685 3056.930 3373.1710 17640.957   100
                                f10(testmx) 4027.017 4251.6385 4865.7036 4399.461 4848.7035 11811.581   100
                                f11(testmx)  500.058  549.1395  624.9526  576.279  636.1395  1176.809   100
                                f12(testmx) 1827.769 1886.4740 1957.0555 1902.834 1964.4270  3600.487   100
Community
  • 1
  • 1
cdeterman
  • 19,630
  • 7
  • 76
  • 100
  • What are these functions? – Marat Talipov Feb 12 '15 at 20:09
  • @MaratTalipov, I stated in my edit that the functions are in the link the OP provided above. I will add the link to be clear or would users prefer I write all the functions out again? – cdeterman Feb 12 '15 at 20:16
  • 1
    How about adding `f10 <- function(x){ table(unlist(data.table(testmtx)[, lapply(.SD, paste, collapse = "")])) } ; f10(testmtx)` too the benchmark? Btw, it is meaningless to benchmark on a vector, because this isn't what the OP wants. – David Arenburg Feb 12 '15 at 20:23
  • @DavidArenburg, `tabulate` still seems to win unless I somehow mistook your example (where I needed to change `testmx` to `x`). I also benchmark on a vector because it is part of the solution I provided. – cdeterman Feb 12 '15 at 20:29
  • I've just benchmarked on a matrix and your approach seems to be the clear winner. The reason for this is probably because we are talking matrices here instead of `data.frames` – David Arenburg Feb 12 '15 at 20:43
  • @DavidArenburg, @cdeterman, could you please also benchmark `f11()` from my answer? – Marat Talipov Feb 12 '15 at 20:59
  • I added some benchmarks with a larger data set. Looks like `f6` wins. – BrodieG Feb 12 '15 at 21:50
  • 2
    Also, you should use `collapse=" "` to avoid frame shift problems if there are numbers > 9 – BrodieG Feb 12 '15 at 21:58
2

This should be somewhat efficient. First objective is to use duplicated to figure out what columns to count, and then use vector recycling and colSums to count the instances of each column.

f12 <- function(testmx) {
  singles <- !duplicated(testmx, MARGIN=2)
  rbind(
    testmx[, singles],
    apply(testmx[, singles], 2, function(x) sum(colSums(abs(testmx - x)) == 0))  
  )    
}

Produces:

     [,1] [,2] [,3]
[1,]    1    0    1
[2,]    2    1    2
[3,]    4    1    1
[4,]    2    3    1

This appears to be much faster than f11 from Marat, but f6 + apply seems to take the cake:

set.seed(1)
testmx <- matrix(sample(1:10, 3 * 1e3, rep=T), nrow=3)

library(microbenchmark)
microbenchmark(
  f12(testmx), 
  f11(testmx), 
  f6(apply(testmx, 2, paste, collapse="")), times=10
)

Unit: milliseconds
                                       expr         min          lq       mean
                                f12(testmx)   36.576060   36.931514   38.18358
                                f11(testmx) 2095.305540 2122.316487 2145.72614
 f6(apply(testmx, 2, paste, collapse = ""))    7.570614    7.601697    8.78227
BrodieG
  • 51,669
  • 9
  • 93
  • 146
  • Yeah, `f11` is terrible here. But try `nrow=1000` :) – Marat Talipov Feb 12 '15 at 22:11
  • @MaratTalipov, good point. Looks like on square matrices `f12` still does better (100 x 100), but on tall matrices (1000 x 100), `f11` pulls even. In either case `f6` wins hands down. – BrodieG Feb 12 '15 at 22:18
1

"Brute force" approach:

f11 <- function(testmtx) {
  nc <- ncol(testmtx)
  z <- seq(nc)  
  for (i in seq(nc-1)) {
    dup <- sapply(seq(i+1,nc),function(j) identical(testmtx[,i],testmtx[,j]))
    z[which(dup)+i] <- z[i]
  }
  table(z)
}

It should have complexity O(N^2*M), where N and M are number of columns and rows, respectively. The other solution, based on paste, has complexity O(N*M^2), so their relative performance should be quite sensitive to N/M.

[EDIT] Actually, I am not sure about the complexity of the paste-based solution -- it could easily be O(N^2*M^2)...

[EDIT2] slightly more efficient alternative to function f11(), which uses @BrodieG's way of comparing matrix column vs matrix:

f13 <- function(testmtx) {
  nc <- ncol(testmtx)
  z <- seq(nc)  
  for (i in seq(nc-1)) {
    dup <- colSums(abs(testmtx[,seq(i+1,nc),drop=F] - testmtx[,i])) == 0
    z[which(dup)+i] <- z[i]
  }
  table(z)
}
Marat Talipov
  • 13,064
  • 5
  • 34
  • 53
  • I think potentially better way to test for duplication is with `duplicated`. It has a method for arrays/matrices (see my answer). – BrodieG Feb 12 '15 at 21:44
1

Here's f6prime for you:

f6prime = function(mat) {
  dt = as.data.table(t(mat));
  dt[, .N, by = names(dt)]
}

f6prime(mat)
#   V1 V2 V3 N
#1:  1  2  4 2
#2:  0  1  1 3
#3:  1  2  1 1
eddi
  • 49,088
  • 6
  • 104
  • 155
  • I admit it is the fastest solution for small number of rows (I tried 10 rows, 1000 columns), but it is ~20 times slower than `f6` for 1000rows, 10 columns case, which in turn is ~5 times slower than `f13`. – Marat Talipov Feb 12 '15 at 22:38
  • I should have specified that my data will be in the form of a matrix with very few rows (from 2 to 50) but 4000 columns. Hence _f6prime_ is perfect. Thank you for your help. – Gaël Giordano Feb 13 '15 at 05:16
  • @GaëlGiordano sure thing; @MaratTalipov large number of rows and small number of columns would be a very strange setup for this problem, as generally one would expect `N` to just be 1 for every column in that case. – eddi Feb 13 '15 at 15:11
0

borrowing from @cdeterman solution. Once you have the vector of posted column values you can simply do a table to get counts

table(vec)
vec
011 121 124 
  3   1   2 
Rohit Das
  • 1,962
  • 3
  • 14
  • 23