0

The fst package http://www.fstpackage.org/fst/ offers multithreaded compression and reading and writing for data frames.

I'm running Bayesian models with brms that are large and slow. I want to save the results to disk for future re-use. Using saveRDS() with compress = "xz" they are ~200MB on disk and, of course, take forever to compress, and a lot of time to read in and decompress.

fst implements fast, multithreaded zstd compression.

x <- list(
  a = mtcars,
  b = as.POSIXlt("2021-01-01 14:00:05"))

saveRDS(compress_fst(serialize(x, NULL), compression = 100),
  file = "test_fst.RDS")

x2 <- unserialize(decompress_fst(readRDS("test_fst.RDS")))

all.equal(x, x2)

Returns TRUE and the other quick tests I have done suggest that this all works fine.

Am I missing any downsides or drawbacks to taking an arbitrary R object, serializing it, passing it to compress_fst() and then writing the compressed object to disk?

Joshua
  • 686
  • 3
  • 7
  • 1
    While I'm not familiar with how fst works or enough computer science to give you at least an educated guess, my experience makes me think there may be compatibility issues down the line when R goes into new major versions, package may also change... It is technically possible to create a frozen environment (hello Docker), but it's something to keep in mind. – Roman Luštrik Sep 11 '21 at 06:44
  • @RomanLuštrik hmm good point, thank you! – Joshua Sep 11 '21 at 06:49
  • `read.fst()` crashes R with a core dump if running in the background on specfic files which are corrupted by a previous `write.fst()` This renders it unsuitable for large scale automated ETL. Unfortunately the package author is not responding on any github issues. Any thoughts? – Lazarus Thurston Nov 14 '22 at 20:52

0 Answers0