9

tl;dr Many feature requests are rejected by R-core because of a maintenance burden, but not hashtab (R>4.2.0). ?hashtab claims to efficiently associate keys with values. Many other implementations (hash, r2r, hashmap, ...) exist, as do environments and user-friendly extensions (rlang, RC, R6, ...) to them. Other than object obfuscation and arbitrary keys, I have not found an obvious use case where hashtab is more efficient than others.

Question
Does hashtab have a distinct feature other than arbitrary keys, or tangible benefit in speed, memory or syntax for some use case?

I have tried to look at the performance, features and internals of environments and a hashtab.

set.seed(1)
make_hash <- function(n, keys, values) {
    h <- hashtab("identical", n)
    for(i in seq_along(keys)) sethash(h, keys[i], values[[i]])
    h
}
make_env <- function(n, keys, values) setNames(values, keys) |> list2env(size = n)

get_mem <- function(x) as.numeric(lobstr::obj_size(x)) * 0.001
compare <- function(n, keylen) {
    keys <- stringi::stri_rand_strings(n, keylen) 
    values <- sample(list(mapply, iris, 1:1e6, "just a string", 1L), n, replace = TRUE)
    ind <- sample(keys, 1)
    h <- make_hash(n, keys, values)
    e <- make_env(n, keys, values)
    data.frame(
        n = n,
        method = c("environment", "hashtab"),
        make_speed = {
            bench::mark(
                make_env(n, keys, values),
                make_hash(n, keys, values),
                check = F
            )$median |> as.character()
        },
        memory = c(get_mem(e), get_mem(h)),
        access = bench::mark(
            e[[ind]],
            gethash(h, ind, NULL), # the natural h[[ind]] is 2x slower
            iterations = 1e4
        )$median |> as.character()
    )
}

Some performance benchmarks,

purrr::map_dfr(c(1e2, 1e3, 1e4, 1e5, 1e6), compare, 10)
         n      method make_speed    memory access
1      100 environment     13.8µs     45.11  200ns
2      100     hashtab    139.9µs     41.61    1µs
3     1000 environment    110.5µs    227.56  200ns
4     1000     hashtab     1.25ms    214.34    1µs
5    10000 environment     1.46ms   2041.96  200ns
6    10000     hashtab    13.64ms   2044.10    1µs
7   100000 environment     53.3ms  20185.96  200ns
8   100000     hashtab    394.9ms  19719.11    1µs
9  1000000 environment       2.2s 201625.96  300ns
10 1000000     hashtab       4.1s 192799.18    1µs

their reference semantics,

e1 <- new.env()
e1$hi <- 1
e2 <- e1
e2$hi <- 2
e1$hi # autocompletion
#> [1] 2

h1 <- hashtab()
sethash(h1, "hi", 1)
h2 <- h1
sethash(h2, "hi", 2)
gethash(h1, "hi")
#> [1] 2

batch access,

e1$bye <- 3
sethash(h1, "bye", 3)

eapply(e1, function(x) x)
#> $hi
#> [1] 2
#> $bye
#> [1] 3
(function(h) {
    val <- list()
    maphash(h, function(k, v) val[[k]] <<- v)
    val
})(h1)
#> $bye
#> [1] 3
#> $hi
#> [1] 2

key names,

e1[[iris]] <- 5 # error. arbitrary object as key... but why?
h1[[iris]] <- 5 # works fine

their internals (explanation), finding that environments contain a hashtab,

e <- new.env(size = 2)
e$x <- 5
.Internal(inspect(e))
#> @0x00000226b4083c48 04 ENVSXP g0c0 [REF(5)] <0x00000226b4083c48>
#> ENCLOS:
#>  @0x00000226ac100778 04 ENVSXP g1c0 [MARK,REF(65535),GL,gp=0x8000] #><R_GlobalEnv>
#> HASHTAB:
#>   @0x00000226b5f6b588 19 VECSXP g0c2 [REF(1)] (len=2, tl=1)
#>     @0x00000226b40b0a70 02 LISTSXP g0c0 [REF(1)] 
#>       TAG: @0x00000226aed32ae0 01 SYMSXP g1c0 [MARK,REF(65535)] "x"
#>       @0x00000226b5f3d3a0 14 REALSXP g0c1 [REF(6)] (len=1, tl=0) 5
#>     @0x00000226ac0add90 00 NILSXP g1c0 [MARK,REF(65535)] 

#  note the (len=8)
h <- hashtab(size = 2)
sethash(h, "x", 5)
.Internal(inspect(h))
#> @0x00000226b5f5e8e0 19 VECSXP g0c1 [OBJ,REF(9),ATT] (len=1, tl=0)
#>   @0x00000226b4361ab0 22 EXTPTRSXP g0c0 [REF(3)] <0x00000226b4361ab0>
#>   PROTECTED:
#>     @0x00000226b5f4e898 19 VECSXP g0c4 [REF(1)] (len=8, tl=0)
#>       @0x00000226ac0add90 00 NILSXP g1c0 [MARK,REF(65535)] 
#>       @0x00000226ac0add90 00 NILSXP g1c0 [MARK,REF(65535)] 
#>       @0x00000226ac0add90 00 NILSXP g1c0 [MARK,REF(65535)] 
#>       @0x00000226ac0add90 00 NILSXP g1c0 [MARK,REF(65535)] 
#>       @0x00000226ac0add90 00 NILSXP g1c0 [MARK,REF(65535)] 
#>       ...
#>   TAG:
#>     @0x00000226b235b2e8 13 INTSXP g0c2 [REF(1)] (len=3, tl=0) 1,0,3
#> ATTRIB:
#>   @0x00000226b4361a78 02 LISTSXP g0c0 [REF(1)] 
#>     TAG: @0x00000226ac0ada80 01 SYMSXP g1c0 [MARK,REF(55126),LCK,gp=0x4000] #> "class" (has value)
#>     @0x00000226b5f5e8a8 16 STRSXP g0c1 [REF(65535)] (len=1, tl=0)
#>       @0x00000226b015bb00 09 CHARSXP g1c1 [MARK,REF(320),gp=0x61] [ASCII] #> [cached] "hashtab"

and finally, their behaviour in a toy package. The first access to the hashtab fails, subsequent accesses succeed.

# devtools::install_github("D-Se/so.hash")
so.hash:::data$hashtab
#> <hashtable (nil): count = 3, type = "identical">
so.hash::grab("x")
#> $env
#> [1] 1
#> 
#> $hash
#> NULL

so.hash:::data$hashtab
#> <hashtable 0x00000168751ebcd0: count = 3, type = "identical">
so.hash::grab("x") # 2nd time asking
#> $env
#> [1] 1
#> 
#> $hash
#> [1] 1

Comparing hashtab to an environment, the

  • memory use is similar,
  • access times is similar,
  • creation takes longer (because of my poor code?),
  • elements can't be auto-completed (in RStudio),
  • key names are more flexible,
  • access to batches of data is rather cumbersome,
  • documentation is minimal (it is still experimental),
  • size argument is not respected (..?),
  • something is PROTECTED1, but is not in new.env(),
  • has inconsistent behavior.

1 I don't know what this means.

Donald Seinen
  • 4,179
  • 5
  • 15
  • 40
  • 3
    Very nicely written question, and a very interesting one too, but could be bordering on opinion-based. What kind of answer are you looking for here? Is it a "killer app" where something can be done with hashtabs that can't be done (or can't be done as easily) with environments? – Allan Cameron Aug 06 '22 at 09:55
  • 1
    @AllanCameron Exactly, either tangible improvement (speed or memory) through a benchmark or a feature I am currently not aware of that makes it different from the implementations mentioned – Donald Seinen Aug 06 '22 at 10:06
  • While the NEWS file flags hashtab as experimental, if it settles in then simply being part of base R is an advantage. Having R-Core voluntarily take on the maintenance burden is a strong indicator of future reliability precisely because they are picky about what features they agree to support (and because they have a long history of supporting those that they've agreed to). Not having a third party package dependencies is an advantage. – Gregor Thomas Aug 06 '22 at 17:53
  • Perhaps when you think you want a .dcf? – Chris Aug 06 '22 at 17:54

0 Answers0