6

I have several algorithms that depend on the efficiency of determining whether an element exists in a vector or not. It seems to me that %in% (which is equivalent to is.element()) should be the most efficient as it simply returns a Boolean value. After testing several methods, to my surprise, those methods are by far the most inefficient. Below is my analysis (the results get worse as the size of the vectors increase):

EfficiencyTest <- function(n, Lim) {

    samp1 <- sample(Lim, n)
    set1 <- sample(Lim, Lim)

    print(system.time(for(i in 1:n) {which(set1==samp1[i])}))
    print(system.time(for(i in 1:n) {samp1[i] %in% set1}))
    print(system.time(for(i in 1:n) {is.element(samp1[i], set1)}))
    print(system.time(for(i in 1:n) {match(samp1[i], set1)}))
    a <- system.time(set1 <- sort(set1))
    b <- system.time(for (i in 1:n) {BinVecCheck(samp1[i], set1)})
    print(a+b)
}

> EfficiencyTest(10^3, 10^5)
user  system elapsed 
0.29    0.11    0.40 
user  system elapsed 
19.79    0.39   20.21 
user  system elapsed 
19.89    0.53   20.44 
user  system elapsed 
20.04    0.28   20.33 
user  system elapsed 
0.02    0.00    0.03 

Where BinVecCheck is a binary search algorithm that I wrote that returns TRUE/FALSE. Note that I include the time it takes to sort the vector with the final method. Here is the code for the binary search:

BinVecCheck <- function(tar, vec) {      
    if (tar==vec[1] || tar==vec[length(vec)]) {return(TRUE)}        
    size <- length(vec)
    size2 <- trunc(size/2)
    dist <- (tar - vec[size2])       
    if (dist > 0) {
        lower <- size2 - 1L
        upper <- size
    } else {
        lower <- 1L
        upper <- size2 + 1L
    }        
    while (size2 > 1 && !(dist==0)) {
        size2 <- trunc((upper-lower)/2)
        temp <- lower+size2
        dist <- (tar - vec[temp])
        if (dist > 0) {
            lower <- temp-1L
        } else {
            upper <- temp+1L
        }
    }       
    if (dist==0) {return(TRUE)} else {return(FALSE)}
}

Platform Info:

> sessionInfo()
R version 3.2.1 (2015-06-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

Question

Is there a more efficient way of determining whether an element exists in a vector in R? For example, is there an equivalent R function to the Python set function, that greatly improves on this approach? Also, why is %in%, and the like, so inefficient even when compared to the which function which gives more information (not only does it determine existence, but it also gives the indices of all true accounts)?

Joseph Wood
  • 7,077
  • 2
  • 30
  • 65
  • 6
    Don't use `system.time()` -- use the rbenchmark or microbenchmark packages and their respective functions `benchmark()` and `microbenchmark()`. There are hundreds of usage examples here. – Dirk Eddelbuettel Oct 31 '15 at 15:20
  • 2
    Related: http://stackoverflow.com/questions/32934933/faster-in-operator – Rich Scriven Oct 31 '15 at 16:03
  • Agree with @DirkEddelbuettel. `system.time()` is not the way to go – alexwhitworth Oct 31 '15 at 16:14
  • Why is `system.time()` a bad choice here? From the [rbenchmark](https://cran.r-project.org/web/packages/rbenchmark/rbenchmark.pdf) documentation, `benchmark()` is simply a wrapper of `system.time()` – Joseph Wood Oct 31 '15 at 16:17
  • Just look at the answer by @ben-bolker: a single call _compares and times multiples alternatives_. As I said above, just look (or better search) around SO for _countless_ usage examples. – Dirk Eddelbuettel Oct 31 '15 at 16:20
  • Why are you looping over `length(samp1)` for `%in%`, `is.element` and, `match`? See, also, `?findInterval`. – alexis_laz Oct 31 '15 at 18:16
  • I think you are implying that those functions are inherently the same.. I simply included them for thoroughness's sake. As you can see from [this post](http://stackoverflow.com/questions/1169248/r-function-for-testing-if-a-vector-contains-a-given-element?rq=1), these functions are quite popular. Also, I was unaware of `rbenchmark`. Lastly, `findInterval` is a really useful function, however, it can't be applied efficiently in all situations (e.g. if you don't know _a priori_ all of the elements of the first vector). – Joseph Wood Oct 31 '15 at 18:37
  • Sorry, I meant that these functions don't need looping for each element; i.e. `samp1 %in% set1` seems valid. – alexis_laz Oct 31 '15 at 19:05

4 Answers4

9

My tests aren't bearing out all of your claims, but that seems (?) to be due to cross-platform differences (which makes the question even more mysterious, and possibly worth taking up on r-devel@r-project.org, although maybe not since the fastmatch solution below dominates anyway ...)

 n <- 10^3; Lim <- 10^5
 set.seed(101)
 samp1 <- sample(Lim,n)
 set1 <- sample(Lim,Lim)
 library("rbenchmark")

 library("fastmatch")
 `%fin%` <- function(x, table) {
     stopifnot(require(fastmatch))
     fmatch(x, table, nomatch = 0L) > 0L
 }
 benchmark(which=sapply(samp1,function(x) which(set1==x)),
           infun=sapply(samp1,function(x) x %in% set1),
           fin= sapply(samp1,function(x) x %fin% set1),
           brc= sapply(samp1,BinVecCheck,vec=sort(set1)),
           replications=20,
    columns = c("test", "replications", "elapsed", "relative"))

##    test replications elapsed relative
## 4   brc           20   0.871    2.329
## 3   fin           20   0.374    1.000
## 2 infun           20   6.480   17.326
## 1 which           20  10.634   28.433

This says that %in% is about twice as fast as which -- your BinVecCheck function is 7x better, but the fastmatch solution from here gets another factor of 2. I don't know if a specialized Rcpp implementation could do better or not ... In fact, I get different answers even when running your code:

##    user  system elapsed   (which)
##   0.488   0.096   0.586 
##    user  system elapsed   (%in%) 
##   0.184   0.132   0.315 
##    user  system elapsed  (is.element)
##   0.188   0.124   0.313 
##    user  system elapsed  (match)
##   0.148   0.164   0.312 
##    user  system elapsed  (BinVecCheck)
##   0.048   0.008   0.055 

update: on r-devel Peter Dalgaard explains the platform discrepancy (which is an R version difference, not an OS difference) by pointing to the R NEWS entry:

match(x, table) is faster, sometimes by an order of magnitude, when x is of length one and incomparables is unchanged, thanks to Haverty's PR#16491.

sessionInfo()
## R Under development (unstable) (2015-10-23 r69563)
## Platform: i686-pc-linux-gnu (32-bit)
## Running under: Ubuntu precise (12.04.5 LTS)
Community
  • 1
  • 1
Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
  • 1. _brc =sapply(samp1,BinVecCheck,vec=set1)_ is incorrect. You can't apply a binary search to an unsorted array. This is okay though: _brc =sapply(samp1,BinVecCheck,vec=**sort(set1)**)._ 2. When I copy your code and run it, my results are completely different than yours. "which" runs about 40x faster than "infun." I am running a completely different OS (Platform: x86_64-w64-mingw32/x64 (64-bit)) – Joseph Wood Oct 31 '15 at 16:08
  • OK, will revise. Don't know about the platform differences. – Ben Bolker Oct 31 '15 at 16:09
  • oddly, sorting barely seems to slow things down at all. – Ben Bolker Oct 31 '15 at 16:13
3

%in% is just sugar for match, and is defined as:

"%in%" <- function(x, table) match(x, table, nomatch = 0) > 0

Both match and which are low level (compiled C) functions called by .Internal(). You can actually see the source code by using the pryr package:

install.packages("pryr")
library(pryr)
pryr::show_c_source(.Internal(which(x)))
pryr::show_c_source(.Internal(match(x, table, nomatch, incomparables)))

You would be pointed to this page for which and this page for match. which does not perform any of the casting, checks etc that match performs. This might explain its higher performance in your tests (but I haven't tested your results myself).

kliron
  • 4,383
  • 4
  • 31
  • 47
2

After many days researching this topic, I have found that the fastest method of determining existence depends on the number of elements being tested. From the answer given by @ben-bolker, %fin% looks like the clear-cut winner. This seems to be the case when the number of elements being tested (all elements in samp1) is small compared to the size of the vector (set1). Before we go any further, lets look at the binary search algorithm above.

First of all, the very first line in the original algorithm has an extremely low probability of evaluating to TRUE, so why check it everytime?

if (tar==vec[1] || tar==vec[size]) {return(TRUE)}

Instead, I put this statement inside the else statement at the very-end.

Secondly, determining the size of the vector every time is redundant, especially when I know the length of the test vector (set1) ahead of time. So, I added size as an argument to the algorithm and simply pass it as a variable. Below is the modified binary search code.

ModifiedBinVecCheck <- function(tar, vec, size) {
    size2 <- trunc(size/2)
    dist <- (tar - vec[size2])
    if (dist > 0) {
        lower <- size2 - 1L
        upper <- size
    } else {
        lower <- 1L
        upper <- size2 + 1L
    }
    while (size2 > 1 && !(dist==0)) {
        size2 <- trunc((upper-lower)/2)
        temp <- lower+size2
        dist <- (tar - vec[temp])
        if (dist > 0) {
            lower <- temp-1L
        } else {
            upper <- temp+1L
        }
    }
    if (dist==0) {
        return(TRUE)
    } else {
        if (tar==vec[1] || tar==vec[size]) {return(TRUE)} else {return(FALSE)}
    }
}

As we know, in order to use a binary search, your vector must be sorted, which cost time. The default sorting method for sort is shell, which can be used on all datatypes, but has the drawback (generally speaking) of being slower than the quick method (quick can only be used on doubles or integers). With quick as my method for sorting (since we are dealing with numbers) combined with the modified binary search, we get a significant performance increase (from the old binary search depending on the case). It should be noted that fmatch improves on match only when the datatype is an integer, real, or character.

Now, let's look at some test cases with differing sizes of n.

Case1 (n = 10^3 & Lim = 10^6, so n to Lim ratio is 1:1000):

n <- 10^3; Lim <- 10^6
set.seed(101)
samp1 <- sample(Lim,n)
set1 <- sample(Lim,Lim)
benchmark(fin= sapply(samp1,function(x) x %fin% set1),
            brc= sapply(samp1,ModifiedBinVecCheck,vec=sort(set1, method = "quick"),size=Lim),
            oldbrc= sapply(samp1,BinVecCheck,vec=sort(set1)),
            replications=10,
            columns = c("test", "replications", "elapsed", "relative"))
test replications elapsed relative
2    brc           10    0.97    4.217
1    fin           10    0.23    1.000
3 oldbrc           10    1.45    6.304

Case2 (n = 10^4 & Lim = 10^6, so n to Lim ratio is 1:100):

n <- 10^4; Lim <- 10^6
set.seed(101)
samp1 <- sample(Lim,n)
set1 <- sample(Lim,Lim)
benchmark(fin= sapply(samp1,function(x) x %fin% set1),
            brc= sapply(samp1,ModifiedBinVecCheck,vec=sort(set1, method = "quick"),size=Lim),
            oldbrc= sapply(samp1,BinVecCheck,vec=sort(set1)),
            replications=10,
            columns = c("test", "replications", "elapsed", "relative"))
test replications elapsed relative
2    brc           10    2.08    1.000
1    fin           10    2.16    1.038
3 oldbrc           10    2.57    1.236

Case3: (n = 10^5 & Lim = 10^6, so n to Lim ratio is 1:10):

n <- 10^5; Lim <- 10^6
set.seed(101)
samp1 <- sample(Lim,n)
set1 <- sample(Lim,Lim)
benchmark(fin= sapply(samp1,function(x) x %fin% set1),
            brc= sapply(samp1,ModifiedBinVecCheck,vec=sort(set1, method = "quick"),size=Lim),
            oldbrc= sapply(samp1,BinVecCheck,vec=sort(set1)),
            replications=10,
            columns = c("test", "replications", "elapsed", "relative"))
    test replications elapsed relative
2    brc           10   13.13    1.000
1    fin           10   21.23    1.617
3 oldbrc           10   13.93    1.061

Case4: (n = 10^6 & Lim = 10^6, so n to Lim ratio is 1:1):

n <- 10^6; Lim <- 10^6
set.seed(101)
samp1 <- sample(Lim,n)
set1 <- sample(Lim,Lim)
benchmark(fin= sapply(samp1,function(x) x %fin% set1),
            brc= sapply(samp1,ModifiedBinVecCheck,vec=sort(set1, method = "quick"),size=Lim),
            oldbrc= sapply(samp1,BinVecCheck,vec=sort(set1)),
            replications=10,
            columns = c("test", "replications", "elapsed", "relative"))
   test replications elapsed relative
2    brc           10  124.61    1.000
1    fin           10  214.20    1.719
3 oldbrc           10  127.39    1.022


As you can see, as n gets large relative to Lim, the efficiency of the binary search (both of them) start to dominate. In Case 1, %fin% was over 4x faster than the modified binary search, in Case2 there was almost no difference, in Case 3 we really start to see the binary search dominance, and in Case 4, the modified binary search is almost twice as fast as %fin%.

Thus, to answer the question "Which method is faster?", it depends. %fin% is faster for a small number of elemental checks with respect to the test vector and the ModifiedBinVecCheck is faster for a larger number of elemental checks with respect to the test vector.

Joseph Wood
  • 7,077
  • 2
  • 30
  • 65
  • Sorry if I'm missing something or a benchmarking purpose, but you don't have to `sapply` `%in%` over "samp1". E.g. : `n <- 10^5; Lim <- 10^6; set.seed(101); samp1 <- sample(Lim,n); set1 <- sample(Lim,Lim)`. `system.time({ brc = sapply(samp1, ModifiedBinVecCheck, vec = sort(set1, method = "quick"), size = Lim) })`. `system.time({ use_in = samp1 %in% set1 })`. `findInt_in = function(x, table) { ints = findInterval(x, sort(table)); ints > 0 & ints < length(table) }`. `system.time({ fint = findInt_in(samp1, set1) })`. `identical(brc, use_in)`. `identical(brc, fint)` – alexis_laz Nov 02 '15 at 13:26
  • @alexis_laz, you are absolutely correct about the implementation of `%in%`. With `%in%` one can determine whether the elements of a vector exist in another vector in one call, however, many times, you will not know all of the elements to be tested _a priori_. If I were to implement this (I haven't tested this), it seems like I would be comparing different things. – Joseph Wood Nov 02 '15 at 14:22
  • Oh, I see. By the way, if memory allows you, you could -at first and once- `tset1 = tabulate(set1)` and in the loop see if `!is.na(tset1[samp1[i]]) & tset1[samp1[i]] != 0`. If "set1" is not an integer, you could `tabulate(match(set1, unique(set1)))` – alexis_laz Nov 02 '15 at 15:26
  • if this is really important to you, you should be able to do still better by translating your binary-sort machinery to Rcpp ... – Ben Bolker Nov 02 '15 at 22:07
  • @BenBolker, I would definitely go that route if I could. As I need this for a project I am working on at my workplace, hence using my work computer, I am not at liberty to download `Rtools` (I need Administrator rights). Thanks for all of your help. – Joseph Wood Nov 03 '15 at 14:04
  • you could conceivably build a binary version of the package elsewhere, put it in your own repository (e.g. check out the drat package), and install it onto your work computer from there -- wouldn't need admin rights. But it's up to you ... – Ben Bolker Nov 03 '15 at 14:18
1

any( x == "foo" ) should be plenty fast if you can be sure that x is free of NAs. If you may have NAs, R 3.3 has a speedup for "%in%" that will help.

For binary search, see findInterval before rolling your own. This doesn't sound like a job for binary search unless x is constant and sorted.

Ethan Bierlein
  • 3,353
  • 4
  • 28
  • 42