0

This is an issue already opened on fstpackage's github, but it seems the package author is no longer actively maintaining it.

Meanwhile, I need a workaround to this crash problem, which is repeatable and occurs regularly but on a small subset of my files. I am trying to find a method to detect a corrupted .fst file, without actually reading it because the crash stops all the further processing of my script.

Here is a sample corrupted fst file that you can download and try to open it using fst::read.fst(). Your R session is likely to crash. If your R session does not crash and you just get an error, then you are lucky (I have tried on an Ubuntu R server as well as on Mac OS with latest R 4.2 and everytime the R session crashes). It may not crash in specific situations, but the question still remains. (For details of the error message please see the github issue link above.)

I want some way to detect if a file is clean or corrupted before running read.fst().

And yes, I have tried tryCatch() but the crash still occurs.

Perhaps scanning the header of the raw data of the file in octal / raw mode may be helpful in detecting unexpected characters like null characters that are causing the crash. But I leave it to you the expert to find a way.

UPDATE Waldi has detected that surprisingly, column wise read.fst() does not crash. However, there are a few problems in this approach.

  1. The column data is corrupted. The file I tested has last 2 (of 4 cols) corrupted. Outputs as follows:
> fst::read.fst("corrupted.fst",columns = c("termid","ts","rv","av"),from = 1,to = 10)
       termid                  ts            rv            av
1  1204011660                <NA> 4.646816e-310 4.646816e-310
2  1204011660 2022-07-21 07:52:43 4.646816e-310 4.646816e-310
3  1204011660 2022-08-18 16:37:19 4.646816e-310 4.646816e-310
4  1204011660 2022-08-18 16:37:20 4.646835e-310 4.646835e-310
5  1204011660 2022-08-18 16:37:50 4.646817e-310 4.646817e-310
6  1204011660 2022-08-18 16:38:13 4.646817e-310 4.646817e-310
7  1204011660 2022-08-18 16:38:43 4.646817e-310 4.646817e-310
8  1204011660 2022-08-18 16:39:13 4.646817e-310 4.646817e-310
9  1204011660 2022-08-18 16:39:15 4.646819e-310 4.646819e-310
10 1204011660 2022-08-18 16:39:45 4.646830e-310 4.646830e-310

  1. The response time tanks, to 5 seconds for just outputting the first 10 rows.
> system.time(fst::read.fst("corrupted.fst",columns = c("termid","ts","rv","av"),from = 1,to = 10))
   user  system elapsed 
  2.069   3.874   5.940 
  1. Column-wise read crashes other corrupted files I have, so it is not a reliable method.

I am waiting for a faster & more reliable solution.

Lazarus Thurston
  • 1,197
  • 15
  • 33

0 Answers0