7

I have an integer vector a

a=function(l) as.integer(runif(l,1,600))
a(100)
  [1] 414 476   6  58  74  76  45 359 482 340 103 575 494 323  74 347 157 503 385 518 547 192 149 222 152  67 497 588 388 140 457 429 353
 [34] 484  91 310 394 122 302 158 405  43 300 439 173 375 218 357  98 196 260 588 499 230  22 369  36 291 221 358 296 206  96 439 423 281
 [67] 581 127 178 330 403  91 297 341 280 164 442 114 234  36 257 307 320 307 222  53 327 394 467 480 323  97 109 564 258   2 355 253 596
[100] 215

and an integer matrix B

B=function(c) matrix(as.integer(runif(5*c,1,600)),nrow=5)
B(10)
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,]  250  411  181  345    4  519  167  395  130   388
[2,]  383  377  555  304  119  317  586  351  136   528
[3,]  238  262  513  476  579  145  461  191  262   302
[4,]  428  467  217  590   50  171  450  189  140   158
[5,]  178   14   31  148  285  365  515   64  166   584

and I would like to make a new boolean l x c matrix that shows whether or not each vector element in a is present in each specific column of matrix B.

I tried this with

ispresent1 = function (a,B) { 
out = outer(a, B, FUN = "==" )
apply(out,c(1,3),FUN="any") }

or with

ispresent2 = function (a,B) t(sapply(1:length(a), function(i) apply(B,2,function(x) a[[i]] %in% x)))

but neither of these ways to do this are very fast:

a1=a(1000)
B1=B(20000)
system.time(ispresent1(a1,B1))
   user  system elapsed 
  76.63    1.08   77.84 

system.time(ispresent2(a1,B1))
   user  system elapsed 
  218.10    0.00   230.00 

(in my application matrix B would have about 500 000 - 2 million columns)

Probably this is something trivial, but what is the proper way to do this?

EDIT: proper syntax, as mentioned below, is ispresent = function (a,B) apply(B,2,function(x) { a %in% x } ), but the Rcpp solution below is still almost 2 times faster! Thanks for this!

Tom Wenseleers
  • 7,535
  • 7
  • 63
  • 103
  • You mean present in *any* of the column of `B` matrix? Your wording is confusing me while looking at your code – David Arenburg Aug 31 '15 at 08:42
  • No I mean presence of each vector element in each of the columns - the "any" in the code is used to work on the 3 dimensional array out produced by outer. – Tom Wenseleers Aug 31 '15 at 08:46
  • To rephrase the requirement: for each element of each column of B2, set TRUE is the value is in a1, FALSE otherwise ? (am I right or did I totally missed the point ?) – Tensibai Aug 31 '15 at 09:34
  • No it's the other way around - for each element of vector a1 I want to know if it occurs among the elements of each of the columns of B1 - I think your apply is working the other way around... – Tom Wenseleers Aug 31 '15 at 09:39
  • The result should therefore be an l x c (here 1000 x 20000) boolean matrix - your apply would return a 5 x 20 000 matrix, which is not what I'm after... Sorry if I wasn't clear somewhere... – Tom Wenseleers Aug 31 '15 at 09:42
  • 3
    @TomWenseleers So in this case: `apply(B1,2,function(x) { a1 %in% x } )` should do. (near 40 times faster than `ispresent1`on my machine) (and results verified with `identical`) – Tensibai Aug 31 '15 at 09:46
  • @Tensibai That's neat, you should post that. – David Arenburg Aug 31 '15 at 09:54
  • 2
    Ha yes of course - many thanks for that - I knew there had to be some simple solution - was clearly making things a little overcomplicated! Would you mind posting it as an answer? – Tom Wenseleers Aug 31 '15 at 09:59

3 Answers3

10

After digging a little, and by curiosity about the Rcpp answer of @Backlin I did write a benchmark of orignal solution and our two solutions:

I had to change a little Backlin's function as inline didn't work on my windows box (sorry if I missed something with it, let me know if there's something to adapt)

Code used:

set.seed(123) # Fix the generator
a=function(l) as.integer(runif(l,1,600))
B=function(c) matrix(as.integer(runif(5*c,1,600)),nrow=5)

ispresent1 = function (a,B) { 
out = outer(a, B, FUN = "==" )
apply(out,c(1,3),FUN="any") }

a1=a(1000)
B1=B(20000)

tensibai <- function(v,m) {
  apply(m,2,function(x) { v %in% x })
}

library(Rcpp)
cppFunction("LogicalMatrix backlin(IntegerVector a,IntegerMatrix B) {
    IntegerVector av(a);
    IntegerMatrix Bm(B);
    int i,j,k;
    LogicalMatrix out(av.size(), Bm.ncol());
    for(i = 0; i < av.size(); i++){
        for(j = 0; j < Bm.ncol(); j++){
            for(k = 0; k < Bm.nrow() && av[i] != Bm(k, j); k++);
            if(k < Bm.nrow()) out(i, j) = true;
        }
    }
    return(out);
}")

Validation:

> identical(ispresent1(a1,B1),tensibai(a1,B1))
[1] TRUE
> identical(ispresent1(a1,B1),backlin(a1,B1))
[1] TRUE

Benchmark:

> library(microbenchmark)
> microbenchmark(ispresent1(a1,B1),tensibai(a1,B1),backlin(a1,B1),times=3)

Unit: milliseconds
               expr        min         lq       mean     median         uq        max neval
 ispresent1(a1, B1) 36358.4633 36683.0932 37312.0568 37007.7231 37788.8536 38569.9840     3
   tensibai(a1, B1)   603.6323   645.7884   802.0970   687.9445   901.3294  1114.7144     3
    backlin(a1, B1)   471.5052   506.2873   528.3476   541.0694   556.7689   572.4684     3

Backlin's solution is slightly faster, proving again Rcpp is a good choice if you know cpp at first :)

Tensibai
  • 15,557
  • 1
  • 37
  • 57
  • You're welcome. @Backlin could check I didn't penalize your function while migrating to `cppFunction` ? – Tensibai Aug 31 '15 at 10:22
  • 1
    Nice @Tensaibai! I'm so into `Rcpp` at the moment that I sometimes forget how good base R is. Unless the data set is huge or you need to run the function a million times your solution would be the sensible choice since it saves you precious development time and is easier to read too. – Backlin Aug 31 '15 at 10:24
  • There isn't any difference in time between your C++ code and mine (yours was in fact 10 milliseconds faster in my benchmarking). – Backlin Aug 31 '15 at 10:27
  • @Backlin I don't get why, I only changed the declaration in fact. I assume 10 ms difference is more due to external factors on the benchmark machine (other processes using the cpu with higher priority for example) – Tensibai Aug 31 '15 at 10:29
  • Just a doubt, wouldn't this be possible within `outer` using `Vectorize` (not tested) – akrun Sep 01 '15 at 11:07
8

Rcpp is awesome for problems like this. It is quite possible that there is some way to do it with data.table or with an existing function, but with the inline package it takes almost less time to write it yourself than to find out.

require(inline)

ispresent.cpp <- cxxfunction(signature(a="integer", B="integer"),
                             plugin="Rcpp", body='
    IntegerVector av(a);
    IntegerMatrix Bm(B);
    int i,j,k;
    LogicalMatrix out(av.size(), Bm.ncol());
    for(i = 0; i < av.size(); i++){
        for(j = 0; j < Bm.ncol(); j++){
            for(k = 0; k < Bm.nrow() && av[i] != Bm(k, j); k++);
            if(k < Bm.nrow()) out(i, j) = true;
        }
    }
    return(out);
')

set.seed(123)
a1 <- a(1000)
B1 <- B(20000)
system.time(res.cpp <- ispresent.cpp(a1, B1))
   user  system elapsed 
  0.442   0.005   0.446
res1 <- ispresent1(a1,B1)
identical(res1, res.cpp)
[1] TRUE
Backlin
  • 14,612
  • 2
  • 49
  • 81
  • Thanks a lot for this! Almost two times faster than the solution mentioned above ispresent = function (a,B) apply(B,2,function(x) { a %in% x } ) ! Great! Thx millions! Will be my chance to finally get into Rcpp! – Tom Wenseleers Aug 31 '15 at 10:04
-1
a=function(l) as.integer(runif(l,1,600))
B=function(c) matrix(as.integer(runif(5*c,1,600)),nrow=5)

ispresent1 = function (a,B) { 
  out = outer(a, B, FUN = "==" )
  apply(out,c(1,3),FUN="any") }

ispresent2 = function (a,B) t(sapply(1:length(a), function(i) apply(B,2,function(x) a[[i]] %in% x)))

ispresent3<-function(a,B){
  tf<-matrix((B %in% a),nrow=5)
  sapply(1:ncol(tf),function(x) a %in% B[,x][tf[,x]])
}

a1=a(1000)
B1=B(20000)

> system.time(ispresent1(a1,B1))
   user  system elapsed 
  29.91    0.48   30.44 

> system.time(ispresent2(a1,B1))
   user  system elapsed 
  89.65    0.15   89.83 

> system.time(ispresent3(a1,B1))
   user  system elapsed 
   0.83    0.00    0.86 

res1<-ispresent1(a1,B1)
res3<-ispresent3(a1,B1)

> identical(res1,res3)
[1] TRUE
vck
  • 827
  • 5
  • 10
  • Thx for this - don't think this is correct though, as my output matrix ispresent should be an l x c (here 1000 x 20000) matrix... – Tom Wenseleers Aug 31 '15 at 08:58
  • `B<-unique(B)` this was unnecessary. But ok i understand you. All a1 values must be unique in your study, i guess. – vck Aug 31 '15 at 09:02
  • Yes that's correct - in my actual data there would not be many (almost no) duplicate elements... Guess I should have stated that... – Tom Wenseleers Aug 31 '15 at 09:08