8

I have large objects in R, that barely fits in my 16GB memory (a data.table database of >4M records, >400 variables).

I'd like to have a hash function that will be used to confirm, that the database loaded into R is not modified.

One fast way to do that is to calculate the database's hash with the previously stored hash.

The problem is that digest::digest function copies (serializes) the data, and only after all data are serialized it will calculate the hash. Which is too late on my hardware... :-(

Does anyone know about a way around this problem?

There is a poor's man solution: save the object into the file, and calculate the hash of the file. But it introduces large, unnecessary overhead (I have to make sure there is a spare on HDD for yet another copy, and need to keep track of all the files that may not be automatically deleted)


Adam Ryczkowski
  • 7,592
  • 13
  • 42
  • 68
  • 2
    Maybe you can hash one column at the time, for instance `dt[,lapply(.SD,digest)]`. Then you check the `hash` of each column or hash the result: `digest(dt[,lapply(.SD,digest)])`. – nicola Feb 13 '16 at 09:46
  • @nicola Thanks a lot. So simple and powerful! Works perfectly (A little enhancement is to call `gc()` in every call to digest to ensure that the unused memory is actually freed) – Adam Ryczkowski Feb 13 '16 at 15:16

2 Answers2

2

Similar problem has been described in our issue tracker here: https://github.com/eddelbuettel/digest/issues/33

The current version of digest can read a file to compute the hash.

Therefore, at least on Linux, we can use a named pipe which will be read by the digest package (in one thread) and from the other side data will be written by another thread.

The following code snippet shows how we can compute a MD5 hash from 10 number by feeding the digester first with 1:5 and then 6:10.

library(parallel)
library(digest)

x <- as.character(1:10) # input

fname <- "mystream.fifo" # choose name for your named pipe
close(fifo(fname, "w")) # creates your pipe if does not exist

producer <- mcparallel({
    mystream <- file(fname, "w")
    writeLines(x[1:5], mystream)
    writeLines(x[6:10], mystream)
    close(mystream) # sends signal to the consumer (digester)
})

digester <- mcparallel({
    digest(fname, file = TRUE, algo = "md5") # just reads the stream till signalled
})

# runs both processes in parallel
mccollect(list(producer, digester))

unlink(fname) # named pipe removed

UPDATE: Henrik Bengtsson provided a modified example based on futures:

library("future")
plan(multiprocess)

x <- as.character(1:10) # input

fname <- "mystream.fifo" # choose name for your named pipe
close(fifo(fname, open="wb")) # creates your pipe if does not exists

producer %<-% {
    mystream <- file(fname, open="wb")
    writeBin(x[1:5], endian="little", con=mystream)
    writeBin(x[6:10], endian="little", con=mystream)
    close(mystream) # sends signal to the consumer (digester)
}

# just reads the stream till signalled
md5 <- digest::digest(fname, file = TRUE, algo = "md5")
print(md5)
## [1] "25867862802a623c16928216e2501a39"
# Note: Identical on Linux and Windows
Viliam Simko
  • 1,711
  • 17
  • 31
1

Following up on nicola's comment, here's a benchmark of the column-wise idea. It seems it doesn't help much, at least not for these at this size. iris is 150 rows, long_iris is 3M (3,000,000).

library(microbenchmark)

#iris
nrow(iris)

microbenchmark(
  whole = digest::digest(iris),
  cols = digest::digest(lapply(iris, digest::digest))
)

#long iris
long_iris = do.call(bind_rows, replicate(20e3, iris, simplify = F))
nrow(long_iris)

microbenchmark(
  whole = digest::digest(long_iris),
  cols = digest::digest(lapply(long_iris, digest::digest))
)

Results:

#normal
Unit: milliseconds
  expr  min   lq mean median   uq  max neval cld
 whole 12.6 13.6 14.4   14.0 14.6 24.9   100   b
  cols 12.5 12.8 13.3   13.1 13.5 23.0   100  a 

#long
Unit: milliseconds
  expr min  lq mean median  uq max neval cld
 whole 296 306  317    311 316 470   100   b
  cols 261 276  290    282 291 429   100  a 
CoderGuy123
  • 6,219
  • 5
  • 59
  • 89
  • I don't think OP's main concern is the execution time, rather it is the memory consumption. – Gregor Thomas Feb 21 '19 at 21:13
  • The two are related -- if need to swap space for complete object, it would be very slow execution time. But I agree my benchmark did not capture the memory aspect directly, I don't know any R benchmark tool that does that. – CoderGuy123 Feb 21 '19 at 21:18
  • Yes, related. I guess I just wish your answer contained a little context in the text about what you're measuring and how it's different. OP's comment on column-wise hashing is *"So simple and powerful! Works perfectly"*, whereas yours is *"seems it doesn't help much."* I think it this would be a better answer if you lead with something like "column-wise hashing helps with memory consumption, but does not make any substantial difference on timing", to make it clear you're looking at a different metric. – Gregor Thomas Feb 21 '19 at 21:28
  • If you do want to track memory usage, [there's a an answer for that](https://stackoverflow.com/a/7856328/903061). – Gregor Thomas Feb 21 '19 at 21:28