Vector subsetting performance: name versus index

Question

If I have a vector v with names:

John       Murray     Lisa       Mike       Joe       Ann 
0.0832090  0.0475580 -0.2797860  0.1086225  0.0104590 -0.0028250

What is time complexity of v['Joe'] versus v[4]? I guess the former would take O(log n) as it should involves binary search, but I'm still not sure whether the latter is O(1) or not.

Also, does the result generalize to the case when v is a list/data frame rather than an atomic vector?

Simon O'Hanlon · Accepted Answer · 2013-12-02T10:22:00.347

7

It seems to be approximately O(n), i.e. a vector scan, in the case of name look ups. Your conjecture of O(1) for lookup using indices seems sound...

#  Unique names for longish vector
nms <- apply( expand.grid( letters , letters , letters , letters ) , 1 , paste , collapse = "" )
length(nms)
#[1] 456976
length(unique(nms))
#[1] 456976

#  Start of names
head(nms)
#[1] "aaaa" "baaa" "caaa" "daaa" "eaaa" "faaa"

#  End of names
tail(nms)
#[1] "uzzz" "vzzz" "wzzz" "xzzz" "yzzz" "zzzz"

#  Large named vector
x <- setNames( runif( 456976 ) , nms )

#  Small named vector
y <- setNames( runif(26) , letters )

#  Timing information
require( microbenchmark )
bm <- microbenchmark( x['daaa'] , x[4] , x['vzzz'] , x[456972] , y['d'] , y[4] )
print( bm , order = 'median' , unit = 'relative' , digits = 3 )
#Unit: relative
#      expr min       lq   median       uq      max neval
# x[456972] NaN 1.00e+00     1.00     1.00    1.000   100
#      x[4] Inf 1.00e+00     1.33     1.07    0.957   100
#      y[4] NaN 5.01e-01     1.33     1.14    0.191   100
#    y["d"] Inf 1.00e+00     2.00     1.25    0.265   100
# x["vzzz"] Inf 6.57e+04 44412.24  9969.64 3439.154   100
# x["daaa"] Inf 6.59e+04 44582.73 10049.63 1207.337   100

edited Dec 02 '13 at 10:22

answered Dec 02 '13 at 10:16

Simon O'Hanlon

58,647
14
142
184

2

Great! Thanks for answer.. Also, you help me to learn a way to experiment with this kind of question. This would be really useful in future!! – chanp Dec 02 '13 at 10:36
It's a bit more complicated than that. See http://stackoverflow.com/questions/3470447 for details. Also note that subsetting n times is much slower than doing a single subset with n values. – hadley Dec 02 '13 at 13:20
@hadley this code only does a single subset for each test. The value you see is the average time taken across 100 runs? And the accepted answer in the link draws the same conclusion. O(1) and O(n). I don't see your point. And I would've thought that `x[n]` simply adds an offset of n to the pointer address of the first address of the vector, hence the O(1). – Simon O'Hanlon Dec 02 '13 at 13:23
@SimonO101 `x[letters]` is not the same as `for (letter in letters) x[letter]` - access might be in O(n) in one, but O(1) in the other. Benchmarking extracting a single value isn't terribly realistic, since you normally subset by many values - and you can't extrapolate the performance from subsetting by a single value. – hadley Dec 02 '13 at 21:31
@hadley That sounds interesting. But does this mean that for each access in for-loop, x[letter] is faster than when it appears outside a loop? Can you please elaborate on that? – chanp Dec 03 '13 at 11:09

Vector subsetting performance: name versus index

1 Answers1

Linked