1

I have around 200 thousand bz2 files in which only one 1 valid. The size of each bz2 file is less than 200 bytes. I need to find the valid one. The command line bz2 utility is taking too much time.

Is there minimal check using file bytes by which I can find invalid bz2 and ignore further processing. I want to do in C/C++ as it would be way faster than shell scripts.

Shashwat Kumar
  • 5,159
  • 2
  • 30
  • 66
  • C and C++ are distinct languages. There is no "C/C++". In any event, how do you imagine the command-line utility you're already using is implemented? In either C or C++, almost certainly. It is unlikely that the shell is adding much overhead. It is possible that a special-purpose tool would be somewhat more efficient, but unless yours is a frequently recurring task, the time and effort to develop such a tool is unlikely to be justified by the time saved running it. – John Bollinger Nov 24 '18 at 15:55
  • There is an informal bzip spec here: https://github.com/dsnet/compress/blob/master/doc/bzip2-format.pdf – Niloct Nov 24 '18 at 16:07
  • "Minimal check" is not a reliable check. It can exclude files that are not bz2, but not detect if the files are *valid* bz2 files. That said, it would not be that much work to use a thread pool (for files to be checked), and [libbz2](ftp://sources.redhat.com/pub/bzip2/docs/manual_toc.html) to verify each file. It would be basically the same as running `bunzip2 -qt` and checking if its exit status is zero, but without the overhead of executing a separate process. – Nominal Animal Nov 24 '18 at 16:31

1 Answers1

1

Got the solution. As per bz2 format, first 3 characters should be 'BZh'. This filtered out all but 19 files.

Shashwat Kumar
  • 5,159
  • 2
  • 30
  • 66