read.fst() crashes R : workaround needed to detect corrupted file before read.fst()

Question

This is an issue already opened on fstpackage's github, but it seems the package author is no longer actively maintaining it.

Meanwhile, I need a workaround to this crash problem, which is repeatable and occurs regularly but on a small subset of my files. I am trying to find a method to detect a corrupted .fst file, without actually reading it because the crash stops all the further processing of my script.

Here is a sample corrupted fst file that you can download and try to open it using fst::read.fst(). Your R session is likely to crash. If your R session does not crash and you just get an error, then you are lucky (I have tried on an Ubuntu R server as well as on Mac OS with latest R 4.2 and everytime the R session crashes). It may not crash in specific situations, but the question still remains. (For details of the error message please see the github issue link above.)

I want some way to detect if a file is clean or corrupted before running read.fst().

And yes, I have tried tryCatch() but the crash still occurs.

Perhaps scanning the header of the raw data of the file in octal / raw mode may be helpful in detecting unexpected characters like null characters that are causing the crash. But I leave it to you the expert to find a way.

UPDATE Waldi has detected that surprisingly, column wise read.fst() does not crash. However, there are a few problems in this approach.

The column data is corrupted. The file I tested has last 2 (of 4 cols) corrupted. Outputs as follows:

> fst::read.fst("corrupted.fst",columns = c("termid","ts","rv","av"),from = 1,to = 10)
       termid                  ts            rv            av
1  1204011660                <NA> 4.646816e-310 4.646816e-310
2  1204011660 2022-07-21 07:52:43 4.646816e-310 4.646816e-310
3  1204011660 2022-08-18 16:37:19 4.646816e-310 4.646816e-310
4  1204011660 2022-08-18 16:37:20 4.646835e-310 4.646835e-310
5  1204011660 2022-08-18 16:37:50 4.646817e-310 4.646817e-310
6  1204011660 2022-08-18 16:38:13 4.646817e-310 4.646817e-310
7  1204011660 2022-08-18 16:38:43 4.646817e-310 4.646817e-310
8  1204011660 2022-08-18 16:39:13 4.646817e-310 4.646817e-310
9  1204011660 2022-08-18 16:39:15 4.646819e-310 4.646819e-310
10 1204011660 2022-08-18 16:39:45 4.646830e-310 4.646830e-310

The response time tanks, to 5 seconds for just outputting the first 10 rows.

> system.time(fst::read.fst("corrupted.fst",columns = c("termid","ts","rv","av"),from = 1,to = 10))
   user  system elapsed 
  2.069   3.874   5.940

Column-wise read crashes other corrupted files I have, so it is not a reliable method.

I am waiting for a faster & more reliable solution.

I just commented at GitHub. You may have better luck if you create a self-contained example crashing on data it wrote. — Dirk Eddelbuettel, Nov 15 '22 at 19:19
thanks, and added my comment on github https://github.com/fstpackage/fst/issues/271#issuecomment-1316466656 — Lazarus Thurston, Nov 16 '22 at 06:56
@sindri_baldur - I have already built considerable software creating small fst files for every asset under monitoring. Its an IOT system I have built. What do you suggest, how much learning time would postgresSQL take? — Lazarus Thurston, Nov 16 '22 at 16:37
thanks, i will give a last try to fst and then move to mysql in which I am more comfortable. — Lazarus Thurston, Nov 16 '22 at 19:09
surprisingly; all columns of the corrupted file (`ts,rv,termid,av` ) can be read individually by using the `columns` argument, fi : `fst::read.fst('corrupted.fst',columns =c('termid'))` — Waldi, Nov 16 '22 at 19:15

read.fst() crashes R : workaround needed to detect corrupted file before read.fst()

0 Answers0