0

If we have a named array, say a 2-by-3 matrix

amatrix <- cbind(a=1:2, b=3:4, c=5:6)
##      a b c
## [1,] 1 3 5
## [2,] 2 4 6

we can subset a column, say #2, by name or by index:

amatrix[, 'b']
## [1] 3 4
amatrix[, 2]
## [1] 3 4

Which of these two subsetting methods is faster, and by how much? I suspect that name subsetting should be slower, owing to string-matching, and wonder if I should take this into account when subsetting hundreds of thousands of arrays.

One question and its answer interestingly report and explain why subsetting lists by [[ can be faster than by $ and vice versa depending on the context. But I have not found any information regarding the present question about [.

zx8754
  • 52,746
  • 12
  • 114
  • 209
pglpm
  • 516
  • 4
  • 14
  • 4
    Questions of optimization, performance, and profiling are typically highly context dependent. For example, one method may be faster on smaller data sets but exponentially slow down on larger sets. That's why various domains tend to have standard test sets. For a specific situation, I would code up the various approaches with representative data and empirically test with the `microbenchmark` package – Marcus May 12 '23 at 17:00
  • @Marcus Absolutely true. I'll eventually have to change my code to try and time both approaches, but it would take more time than I have right now. I was hoping that some answer could give an explanation of the underlying workings, besides benchmarking, – like the answer linked in my post – to help me understand how that'd work in my case. – pglpm May 12 '23 at 17:13

1 Answers1

2

We can do an experiment:

# long named vector
v <- setNames(
  1:1e6,
  paste0('V', 1:1e6)
)

b <- bench::mark(
  index_by_position = v[1000],
  index_by_name = v['V1000'],
  min_time = 10
)
plot(b)
# A tibble: 2 × 13
  expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory    
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>    
1 index_by_…    412ns    501ns  1594553.        0B     0    10000     0     6.27ms <int>  <Rprofmem>
2 index_by_…   1.12ms   1.64ms      486.    7.63MB     5.71  2977    35      6.12s <int>  <Rprofmem>
# ℹ 2 more variables: time <list>, gc <list>

enter image description here

It appears that indexing by name is substantially slower.

Playing around a bit, this performance difference:

  • appears to be very similar for a [1, N] matrix,
  • becomes larger as the vector grows.
Axeman
  • 32,068
  • 8
  • 81
  • 94
  • 3
    For reproduction, this requires both `ggplot2` and `ggbeeswarm` to show that plot. They don't have to be _loaded_, just _installed_. – r2evans May 12 '23 at 17:00