I have several algorithms that depend on the efficiency of determining whether an element exists in a vector or not. It seems to me that %in%
(which is equivalent to is.element()
) should be the most efficient as it simply returns a Boolean value. After testing several methods, to my surprise, those methods are by far the most inefficient. Below is my analysis (the results get worse as the size of the vectors increase):
EfficiencyTest <- function(n, Lim) {
samp1 <- sample(Lim, n)
set1 <- sample(Lim, Lim)
print(system.time(for(i in 1:n) {which(set1==samp1[i])}))
print(system.time(for(i in 1:n) {samp1[i] %in% set1}))
print(system.time(for(i in 1:n) {is.element(samp1[i], set1)}))
print(system.time(for(i in 1:n) {match(samp1[i], set1)}))
a <- system.time(set1 <- sort(set1))
b <- system.time(for (i in 1:n) {BinVecCheck(samp1[i], set1)})
print(a+b)
}
> EfficiencyTest(10^3, 10^5)
user system elapsed
0.29 0.11 0.40
user system elapsed
19.79 0.39 20.21
user system elapsed
19.89 0.53 20.44
user system elapsed
20.04 0.28 20.33
user system elapsed
0.02 0.00 0.03
Where BinVecCheck
is a binary search algorithm that I wrote that returns TRUE
/FALSE
. Note that I include the time it takes to sort the vector with the final method. Here is the code for the binary search:
BinVecCheck <- function(tar, vec) {
if (tar==vec[1] || tar==vec[length(vec)]) {return(TRUE)}
size <- length(vec)
size2 <- trunc(size/2)
dist <- (tar - vec[size2])
if (dist > 0) {
lower <- size2 - 1L
upper <- size
} else {
lower <- 1L
upper <- size2 + 1L
}
while (size2 > 1 && !(dist==0)) {
size2 <- trunc((upper-lower)/2)
temp <- lower+size2
dist <- (tar - vec[temp])
if (dist > 0) {
lower <- temp-1L
} else {
upper <- temp+1L
}
}
if (dist==0) {return(TRUE)} else {return(FALSE)}
}
Platform Info:
> sessionInfo()
R version 3.2.1 (2015-06-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
Question
Is there a more efficient way of determining whether an element exists in a vector in R? For example, is there an equivalent R function to the Python set function, that greatly improves on this approach? Also, why is %in%
, and the like, so inefficient even when compared to the which
function which gives more information (not only does it determine existence, but it also gives the indices of all true accounts)?