3

I have a vector of values (x).

I would like to determine the length of its overlap with each of the sets sitting in a list (y) - but without running a loop or lapply. Is it possible? I am really interested in accelerating the execution.

Thank you very much! Below is an example with an implementation using a loop:

x <- c(1:5)
y <- list(1:5, 2:6, 3:7, 4:8, 5:9, 6:10)
overlaps <- rep(0, length(y))
for (i in seq(length(y))) { #i=1
  # overlaps[i] <- length(intersect(x, y[[i]]))  # it is slower than %in% 
  overlaps[i] <- sum(x %in% y[[i]])
}
overlaps

And below is the comparison of some of the methods that were suggested in the responses below. As you can see, the loop is still the fastest - but I'd love to find something faster:

# Function with the loop:
myloop <- function(x, y) {
  overlaps <- rep(0, length(y))
  for (i in seq(length(y))) overlaps[i] <- sum(x %in% y[[i]])
  overlaps
}

# Function with sapply:
mysapply <- function(x, y) sapply(y, function(e) sum(e %in% x))

# Function with map_dbl:
library(purrr)
mymap <- function(x, y) {
  map_dbl(y, ~sum(. %in% x))
}

library(microbenchmark)
microbenchmark(myloop(x, y), mysapply(x, y), mymap(x, y), times = 30000)

# Unit: microseconds
#           expr  min   lq     mean median   uq      max neval
#   myloop(x, y) 17.2 19.4 26.64801   21.2 22.6   9348.6 30000
# mysapply(x, y) 27.1 29.5 39.19692   31.0 32.9  20176.2 30000
#    mymap(x, y) 59.8 64.1 88.40618   66.0 70.5 114776.7 30000
user3245256
  • 1,842
  • 4
  • 24
  • 51

3 Answers3

5

Use sapply for code compactness.

Even if sapply doesn't bring much performance benefits, compared to a for loop, at least the code is far more compact. This is the sapply equivalent of your code:

x <- c(1:5)
y <- list(1:5, 2:6, 3:7, 4:8, 5:9, 6:10)    
res <- sapply(y, function(e) length(intersect(e, x)))

> res
[1] 5 4 3 2 1 0

Performance gains

As correctly stated by @StupidWolf, it's not sapply that is slowing down the execution, but rather length and intersect. That's my test with 100.000 executions:

B <- 100000
system.time(replicate(B, sapply(y, function(e) length(intersect(e, x)))))
user  system elapsed 
9.79    0.01    9.79

system.time(replicate(B, sapply(y, function(e) sum(e %in% x))))
user  system elapsed 
2       0       2

#Using microbenchmark for preciser results:
library(microbenchmark)
microbenchmark(expr1 = sapply(y, function(e) length(intersect(e, x))), times = B)
expr  min   lq     mean median   uq    max neval
expr1 81.4 84.9 91.87689   86.5 88.2 7368.7 1e+05

microbenchmark(expr2 = sapply(y, function(e) sum(e %in% x)), times = B)
expr  min   lq     mean median uq    max neval
expr2 15.4 16.1 17.68144   16.4 17 7567.9 1e+05

As we can see, the second approach is by far the performance winner.

Hope this helps.

Louis
  • 3,592
  • 2
  • 10
  • 18
  • I upvoted for code brevity but can't accept it as a response - because I microbenchmarked it against the 'for' loop and it's slower (mean execution time is 18.7% slower than the for loop, median time is 17% slower) – user3245256 Jan 10 '20 at 18:05
  • @user3245256 I've edited my answer showing the performance comparisons. Hope this helps. – Louis Jan 10 '20 at 18:55
  • Appreciate your looking into it and finding that %in% is faster than intersect. I compared 3 approaches again - this time using e %in% x in sapply AND in the loop. Btw, when I use microbenchmark, I put all expressions I am comparing into one run: microbenchmark(f1(x, y), f2(x, y), f3(x, y), times = 30000). I found map_dbl to be the slowest - more than 3 times the speed of the for loop (with % in it). And sapply (also with % in it) is still slower than the for loop. For loop: mean = 30.5, median = 21.5 and sapply: mean = 42.7, median = 31.5. Unsure why everyone disparages for loops in R. – user3245256 Jan 10 '20 at 21:56
  • I show the microbenchmark comparison in my original post. – user3245256 Jan 10 '20 at 22:04
  • @user3245256 Benchmarking on such small data is of limited value. Scale up the dimensions of the objects and the results will be different. – Ritchie Sacramento Jan 10 '20 at 22:28
  • @H 1: 30000 times in microbenchmarking is not really 'small data'. But to alleviate your concern, I just did microbenchmarking where x has 8 values and y has 1500 elements each with a vector of 8 values. The results are pretty much the same: for mean - myloop 1.68, mysapply 2.3, mymap 2.67. And for median: 1.51, 2.11, and 2.44, respectively. – user3245256 Jan 10 '20 at 22:50
  • @user3245256 - The number of repetitions in the benchmarking is wholly irrelevant to the concept of data size. 8 values by 1500 is still small. Try `x <- c(1:1000); y <- replicate(1000, sample(100000, 10000), simplify = FALSE); microbenchmark::microbenchmark(myloop(x, y), mysapply(x, y), mymap(x, y), times = 100)`. – Ritchie Sacramento Jan 10 '20 at 23:10
  • @H1 Well, in my particular task I need exactly the comparison of one set of 8 values to 3,000-4,000 sets of 8 values. But I did try your example and to be fair, the mean for myloop is 310 while for both sapply and map_dbl it's 190. What is your explanation for why for is slower? I understand each individual comparison is taking longer. But why is for slower than sapply that loops through as many comparisons? – user3245256 Jan 10 '20 at 23:27
  • I also benchmarked my actual task (a comparison of one vector of 8 elements) with 4000 vectors of 8 elements. And for loop is again the fastest, followed by sapply and then by map_dbl – user3245256 Jan 10 '20 at 23:34
2

You can use map from purrr, it goes through every element of the list y, and performs a function. Below i use map_dbl which returns a vector

library(purrr)
map_dbl(y,~+(. %in% x))
[1] 5 4 3 2 1 0

To see the time:

f1 = function(){
x <- c(1:5)
y <- lapply(1:5,function(i)sample(1:10,5,replace=TRUE))
map_dbl(y,~sum(. %in% x))
}

f2 = function(){
x <- c(1:5)
y <- lapply(1:5,function(i)sample(1:10,5,replace=TRUE))
overlaps <- rep(0, length(y))
for (i in seq(length(y))) { #i=1
    overlaps[i] <- length(intersect(x, y[[i]]))
  }
  overlaps
}

f3 = function(){
  x <- c(1:5)
  y <- lapply(1:5,function(i)sample(1:10,5,replace=TRUE))
  sapply(y,function(i)sum(i%in%x))
}

Let's put it to test:

system.time(replicate(10000,f1()))
   user  system elapsed 
   1.27    0.02    1.35 

system.time(replicate(10000,f2()))
   user  system elapsed 
   1.72    0.00    1.72 

 system.time(replicate(10000,f3()))
   user  system elapsed 
   0.97    0.00    0.97 

So if you want speed, do something like sapply + %in% , if something easily readable, do purrr

StupidWolf
  • 45,075
  • 17
  • 40
  • 72
  • I upvoted as it's very elegant, but can't accept it as a response - because I microbenchmarked it against the 'for' loop and it's a tad slower (mean execution time is 11% slower than the for loop, median time is 3.9% slower) – user3245256 Jan 10 '20 at 18:04
  • This is quite interesting, you never mentioned runtime explicitly in the question. So, the reason your code runs slow comes from the intersect and length. Use the sum( %in% ) – StupidWolf Jan 10 '20 at 18:16
  • And you have to count the time taken to create your empty vector overlap :) – StupidWolf Jan 10 '20 at 18:22
  • You can't use system.time for speed testing - it's too imprecise. I tested using microbenchmark, not system.time. I built 3 functions - same as you, but each function took x and y as parameters so that the function itself contains only the calculations. I ran it with times = 20000. The result was same as before: for loop (fastest): mean speed 91.8 microsec, median 67.6 microsec; map_dbl: mean 103.6, median 70.2, and lapply: mean 109.3 and median 78.8. – user3245256 Jan 10 '20 at 18:36
  • I did mention that I don't want loops nor apply (hidden loop), but yes, my rationale was, of course, speed. map_dbl looked like a cool solution, but it's not faster than the loop. – user3245256 Jan 10 '20 at 18:38
1

Here is an option using data.table which should be fast if you have a long list of vectors in y.

library(data.table)
DT <- data.table(ID=rep(seq_along(y), lengths(y)), Y=unlist(y))
DT[.(Y=x), on=.(Y)][, .N, ID]

In addition if you need to run this for multiple x, I would suggest creating a data.table that combines all of the x before running the code

output:

   ID N
1:  1 5
2:  2 4
3:  3 3
4:  4 2
5:  5 1
chinsoon12
  • 25,005
  • 4
  • 25
  • 35