extract elements from a list based on a vector of indices

Question

I want to extract elements from a list based on indices stored in a separate vector.

This is my attempt at it:

list_positions<-c(2,3,4)
my_list<-list(c(1,3,4),c(2,3,4,5,6),c(1,2,3,4,6))

my_fun<-function(x,y){
  x[y]
}

mapply(my_fun,x=my_list,y=list_positions)

Maybe somebody can suggest a faster solution. My list is has around 14 million elements. I tried parallel solutions, where instead of mapply I used clusterMap but still I would like to have a better performance.

Generally, evaluating a "closure" is more costly than a primitive, and -obviously- that is relevant only in repeated evaluations, so replacing `my_fun` for `"["` should gain _a bit_ more speed. — alexis_laz, Nov 27 '16 at 16:38

akrun · Accepted Answer · 2016-11-27T16:47:46.777

3

We may unlist the list, create index based on lengths of 'my_list' and extract the vector

v1 <- unlist(my_list)
p1 <- list_positions
v1[cumsum(lengths(my_list))- (lengths(my_list)-p1)]
#[1] 3 4 4

Benchmarks

set.seed(24)
lst <- lapply(1:1e6, function(i) sample(1:10, sample(2:5), replace=FALSE))
p2 <- sapply(lst, function(x) sample(length(x), 1))
system.time({
r1 <- mapply(`[`, lst, p2)
 })
#user  system elapsed 
#   1.84    0.02    1.86 

system.time( r4 <-  mapply(my_fun, lst, p2) )
#   user  system elapsed 
#   1.88    0.01    1.89 
system.time({ r4 <-  mapply(my_fun, lst, p2) }) #placing inside the {}
#   user  system elapsed 
#   2.31    0.00    2.31 


system.time({ ##cccmir's function
  r3 <- mapply(my_func1, lst, p2)
})
#   user  system elapsed 
#  12.10    0.03   12.13 


system.time({
v2 <- unlist(lst)
r2 <- v2[cumsum(lengths(lst))- (lengths(lst)-p2)]
})
#  user  system elapsed 
#   0.14    0.00    0.14 
identical(r1, r2)
#[1] TRUE

edited Nov 27 '16 at 16:47

answered Nov 27 '16 at 16:12

akrun

874,273
37
540
662

I guess, `system.time( mapply(my_fun, lst, p2) )` could be interesting to show too, since OP is using a "closure" instead of `"["` -- on my machine using `my_fun` is ~1.8 times slower. btw, cccmir's does not need `mapply` `my_func1(lst, p2)` should do it. – alexis_laz Nov 27 '16 at 16:45
@alexis_laz I updated it, It showed only minimal increase on my system. – akrun Nov 27 '16 at 16:48
@alexis_laz Let me put the `{}` and check if increases – akrun Nov 27 '16 at 16:48
Interesting. Do multiple runs with `system.time` show indeed similar performance? For the record, I get 2.3s for "r4" VS 1.3s for "r1" consistently – alexis_laz Nov 27 '16 at 16:51
1

@alexis_laz I tried couple of time, it is kind of fluctuating between 2 and 2.3, 2.13 etc. May be `microbenchmark` is better – akrun Nov 27 '16 at 16:53

cccmir · Answer 2 · 2016-11-27T16:33:06.930

2

you should use a for loop in this case, for example:

 library(microbenchmark)
    list_positions<-c(2,3,4)
    my_list<-list(c(1,3,4),c(2,3,4,5,6),c(1,2,3,4,6))

    my_fun<-function(x,y){
        x[y]
    }

    mapply(my_fun,x=my_list,y=list_positions)

    my_func1 <- function(aList, positions){
        res <- numeric(length(aList))

        for(i in seq_along(aList)) {
            res[i] <- aList[[i]][positions[i]]
        }
        return(res)
    }


my_func2 <- function(aList, positions) {
    v1 <- unlist(aList)
    p1 <- positions
    v1[cumsum(lengths(my_list))- (lengths(my_list)-p1)]
}

microbenchmark(mapply(my_fun,x=my_list,y=list_positions), my_func1(my_list, list_positions), my_func2(my_list, list_positions), times = 1000)

#Unit: microseconds
#                                           expr    min     lq      mean median     uq     max neval
#mapply(my_fun, x = my_list, y = list_positions) 12.764 13.858 17.453172 14.588 16.775 119.613  1000
#               my_func1(my_list, list_positions)  5.106  5.835  7.328412  6.200  6.929  38.292  1000
#               my_func2(my_list, list_positions)  2.553  3.282  4.337367  3.283  3.648  52.514  1000

@akrun solution is the fastest

edited Nov 27 '16 at 16:33

answered Nov 27 '16 at 16:15

cccmir

953
6
12

Can you explain why your solution is faster than mine. I always though that "apply" family solutions are faster than loops – Vitalijs Nov 27 '16 at 16:25
you can see it is faster by the micro benchmark results. yours mean: 17.3226 mine mean: 7.63 this is the results of 1000 times – cccmir Nov 27 '16 at 16:26
@akrun result is better than mine by benchmarking so you should use it – cccmir Nov 27 '16 at 16:29
On an 1e6 dataset, your solution is getting me slower output. Did I miss something? – akrun Nov 27 '16 at 16:29
I see the result, I just wanted to understand "why" mapply works slower than loop! – Vitalijs Nov 27 '16 at 16:37
I believe `my_func1` can be sigificantly faster mostly by (i) wrapping it with `compiler::cmpfun` and less by (ii) allocating `res` as `vector(typeof(aList[[1]]), length(aList))` to avoid coercions in case of type mismatch. – alexis_laz Nov 27 '16 at 16:41
sorry didin't got it, i'm not sure why – cccmir Nov 27 '16 at 16:41
@VitalijsJascisens : `mapply` should be comparable with a "for loop" for larger datasets. I guess in this small dataset `mapply` gets beat by its overhead. – alexis_laz Nov 27 '16 at 16:47
@VitalijsJascisens you should use microbenchmark to test preformance beetween method on large dataset – cccmir Nov 27 '16 at 16:49

extract elements from a list based on a vector of indices

2 Answers2

Benchmarks