43

I've been working on a few projects that have required me to do a lot of list subsetting and while profiling code I realised that the object[["nameHere"]] approach to subsetting lists was usually faster than the object$nameHere approach.

As an example if we create a list with named components:

a.long.list <- as.list(rep(1:1000))
names(a.long.list) <- paste0("something",1:1000)

Why is this:

system.time (
for (i in 1:10000) {
    a.long.list[["something997"]]
}
)


user  system elapsed 
0.15    0.00    0.16 

faster than this:

system.time (
    for (i in 1:10000) {
        a.long.list$something997
    }
)

user  system elapsed 
0.23    0.00    0.23 

My question is simply whether this behaviour is true universally and I should avoid the $ subset wherever possible or does the most efficient choice depend on some other factors?

Jon M
  • 1,157
  • 1
  • 10
  • 16
  • 9
    +1. I suspect it's related to partial matching with the `$` sign. Suppose you have `my_list <- list("a" = 1, "ace" = 2)`. If you try `my_list$ac` it gets `ace`, but if you try `my_list[["ac"]]`, it finds nothing. – Frank May 18 '13 at 23:48
  • 1
    Not answering your question, but if performance were an issue, then you'd rather write a vectorized look-up `query <- sample(names(a.long.list), 1000); a.long.list[query]` to play well with your other vectorized code. – Martin Morgan May 19 '13 at 00:23
  • 5
    not ruling out the partial matching theory, but what I hope a complete answer will include is why adding `exact = FALSE` to `[[` in the OP's example does not degrade the performance. – flodel May 19 '13 at 11:48
  • If we change the number of list items to 6 and search for the last one then $ seems faster: `n <- 6; short <- as.list(rep(1:n)); names(short) <- paste0("something",1:n); system.time ( for (i in 1:10000) short[["something6"]] ); system.time ( for (i in 1:10000) short$something6 ) ` – G. Grothendieck May 19 '13 at 12:34
  • @G.Grothendieck At least on my system, the [[]] approach is still faster than $ for that list. I had to bump the reps up to 1000000 to get a difference between the two: elapsed: 0.46 versus elapsed: 0.56. – Jon M May 19 '13 at 12:47
  • 7
    Seems worth mentioning that `$` and `[[` are implemented by two entirely different C functions (both in `src/main/subset.c`). For `$`, the relevant function is [`do_subset3`](https://github.com/wch/r-source/blob/trunk/src/main/subset.c#L1057) which in turn calls [`R_subset3_dflt`](https://github.com/wch/r-source/blob/trunk/src/main/subset.c#L1106). `[[` uses another function, [`do_subset2`](https://github.com/wch/r-source/blob/trunk/src/main/subset.c#L840), which in turn calls [`do_subset2_dflt`](https://github.com/wch/r-source/blob/trunk/src/main/subset.c#L863). – Josh O'Brien May 19 '13 at 17:23
  • 5
    The comment preceding `do_subset2` notes simply: "The [[ subset operator. It needs to be fast." – Josh O'Brien May 19 '13 at 17:25
  • 3
    Also probably worth mentioning one of the newest changes in R 3.0.0: "Partial matching when using the $ operator on data frames now throws a warning and may become defunct in the future. If partial matching is intended, replace foo$bar by foo[["bar", exact = FALSE]]." – zap2008 May 21 '13 at 02:31

1 Answers1

11

Function [[ first goes through all elements trying for exact match, then tries to do partial match. The $ function tries both exact and partial match on each element in turn. If you execute:

system.time (
    for (i in 1:10000) {
     a.long.list[["something9973", exact=FALSE]]
     }
)

i.e., you are running a partial match where there is no exact match, you will find that $ is in fact ever so slightly faster.

Bojan Nikolic
  • 1,296
  • 12
  • 8
  • I think this answers Flodel's clarifying question about why adding exact = FALSE doesn't degrade performance. Anyway I'm now convinced that in programming contexts where speed matters it is going to be better to use [[ unless there is a high probability of needing partial matching (which more often creates bugs in my programs than solves problems). – Jon M May 29 '13 at 21:52
  • 1
    BTW if looking for >100x performance for a 10000 element list, then convert list with `as.environment(a.long.list)` and perform lookup on that. Environments are implemented as hash-maps which have near constant lookup time. Linear lists lookups gets proportionally slower with size (how far down in the list elements are). – Soren Havelund Welling Sep 08 '22 at 10:58