tl;dr
Many feature requests are rejected by R-core because of a maintenance burden, but not hashtab
(R>4.2.0). ?hashtab
claims to efficiently associate keys with values. Many other implementations (hash, r2r, hashmap, ...) exist, as do environments and user-friendly extensions (rlang, RC, R6, ...) to them. Other than object obfuscation and arbitrary keys, I have not found an obvious use case where hashtab
is more efficient than others.
Question
Does hashtab
have a distinct feature other than arbitrary keys, or tangible benefit in speed, memory or syntax for some use case?
I have tried to look at the performance, features and internals of environments and a hashtab
.
set.seed(1)
make_hash <- function(n, keys, values) {
h <- hashtab("identical", n)
for(i in seq_along(keys)) sethash(h, keys[i], values[[i]])
h
}
make_env <- function(n, keys, values) setNames(values, keys) |> list2env(size = n)
get_mem <- function(x) as.numeric(lobstr::obj_size(x)) * 0.001
compare <- function(n, keylen) {
keys <- stringi::stri_rand_strings(n, keylen)
values <- sample(list(mapply, iris, 1:1e6, "just a string", 1L), n, replace = TRUE)
ind <- sample(keys, 1)
h <- make_hash(n, keys, values)
e <- make_env(n, keys, values)
data.frame(
n = n,
method = c("environment", "hashtab"),
make_speed = {
bench::mark(
make_env(n, keys, values),
make_hash(n, keys, values),
check = F
)$median |> as.character()
},
memory = c(get_mem(e), get_mem(h)),
access = bench::mark(
e[[ind]],
gethash(h, ind, NULL), # the natural h[[ind]] is 2x slower
iterations = 1e4
)$median |> as.character()
)
}
Some performance benchmarks,
purrr::map_dfr(c(1e2, 1e3, 1e4, 1e5, 1e6), compare, 10)
n method make_speed memory access
1 100 environment 13.8µs 45.11 200ns
2 100 hashtab 139.9µs 41.61 1µs
3 1000 environment 110.5µs 227.56 200ns
4 1000 hashtab 1.25ms 214.34 1µs
5 10000 environment 1.46ms 2041.96 200ns
6 10000 hashtab 13.64ms 2044.10 1µs
7 100000 environment 53.3ms 20185.96 200ns
8 100000 hashtab 394.9ms 19719.11 1µs
9 1000000 environment 2.2s 201625.96 300ns
10 1000000 hashtab 4.1s 192799.18 1µs
their reference semantics,
e1 <- new.env()
e1$hi <- 1
e2 <- e1
e2$hi <- 2
e1$hi # autocompletion
#> [1] 2
h1 <- hashtab()
sethash(h1, "hi", 1)
h2 <- h1
sethash(h2, "hi", 2)
gethash(h1, "hi")
#> [1] 2
batch access,
e1$bye <- 3
sethash(h1, "bye", 3)
eapply(e1, function(x) x)
#> $hi
#> [1] 2
#> $bye
#> [1] 3
(function(h) {
val <- list()
maphash(h, function(k, v) val[[k]] <<- v)
val
})(h1)
#> $bye
#> [1] 3
#> $hi
#> [1] 2
key names,
e1[[iris]] <- 5 # error. arbitrary object as key... but why?
h1[[iris]] <- 5 # works fine
their internals (explanation), finding that environments contain a hashtab
,
e <- new.env(size = 2)
e$x <- 5
.Internal(inspect(e))
#> @0x00000226b4083c48 04 ENVSXP g0c0 [REF(5)] <0x00000226b4083c48>
#> ENCLOS:
#> @0x00000226ac100778 04 ENVSXP g1c0 [MARK,REF(65535),GL,gp=0x8000] #><R_GlobalEnv>
#> HASHTAB:
#> @0x00000226b5f6b588 19 VECSXP g0c2 [REF(1)] (len=2, tl=1)
#> @0x00000226b40b0a70 02 LISTSXP g0c0 [REF(1)]
#> TAG: @0x00000226aed32ae0 01 SYMSXP g1c0 [MARK,REF(65535)] "x"
#> @0x00000226b5f3d3a0 14 REALSXP g0c1 [REF(6)] (len=1, tl=0) 5
#> @0x00000226ac0add90 00 NILSXP g1c0 [MARK,REF(65535)]
# note the (len=8)
h <- hashtab(size = 2)
sethash(h, "x", 5)
.Internal(inspect(h))
#> @0x00000226b5f5e8e0 19 VECSXP g0c1 [OBJ,REF(9),ATT] (len=1, tl=0)
#> @0x00000226b4361ab0 22 EXTPTRSXP g0c0 [REF(3)] <0x00000226b4361ab0>
#> PROTECTED:
#> @0x00000226b5f4e898 19 VECSXP g0c4 [REF(1)] (len=8, tl=0)
#> @0x00000226ac0add90 00 NILSXP g1c0 [MARK,REF(65535)]
#> @0x00000226ac0add90 00 NILSXP g1c0 [MARK,REF(65535)]
#> @0x00000226ac0add90 00 NILSXP g1c0 [MARK,REF(65535)]
#> @0x00000226ac0add90 00 NILSXP g1c0 [MARK,REF(65535)]
#> @0x00000226ac0add90 00 NILSXP g1c0 [MARK,REF(65535)]
#> ...
#> TAG:
#> @0x00000226b235b2e8 13 INTSXP g0c2 [REF(1)] (len=3, tl=0) 1,0,3
#> ATTRIB:
#> @0x00000226b4361a78 02 LISTSXP g0c0 [REF(1)]
#> TAG: @0x00000226ac0ada80 01 SYMSXP g1c0 [MARK,REF(55126),LCK,gp=0x4000] #> "class" (has value)
#> @0x00000226b5f5e8a8 16 STRSXP g0c1 [REF(65535)] (len=1, tl=0)
#> @0x00000226b015bb00 09 CHARSXP g1c1 [MARK,REF(320),gp=0x61] [ASCII] #> [cached] "hashtab"
and finally, their behaviour in a toy package. The first access to the hashtab fails, subsequent accesses succeed.
# devtools::install_github("D-Se/so.hash")
so.hash:::data$hashtab
#> <hashtable (nil): count = 3, type = "identical">
so.hash::grab("x")
#> $env
#> [1] 1
#>
#> $hash
#> NULL
so.hash:::data$hashtab
#> <hashtable 0x00000168751ebcd0: count = 3, type = "identical">
so.hash::grab("x") # 2nd time asking
#> $env
#> [1] 1
#>
#> $hash
#> [1] 1
Comparing hashtab
to an environment, the
- memory use is similar,
- access times is similar,
- creation takes longer (because of my poor code?),
- elements can't be auto-completed (in RStudio),
- key names are more flexible,
- access to batches of data is rather cumbersome,
- documentation is minimal (it is still experimental),
- size argument is not respected (..?),
- something is PROTECTED1, but is not in
new.env()
, - has inconsistent behavior.
1 I don't know what this means.