0

I have a folder (actually on HDFS but I don't think that affects the question) of .bz2 files. Some of these when decompressed give single empty files. I would like to remove all the .bz2 files which decompress to empty and I notice that they all have size 14 bytes. Is it safe to simply remove all 14 byte files? Or is it possible for a non-empty file to compress to/decompress from a 14 byte bz2?

tex94
  • 36
  • 3

2 Answers2

0

BZ2 is a compressed file format used by Bzip 2. Bzip 2 is an open and free compression program created by Julian Seward. BZ2 files use Burrows-Wheeler compression algorithm combined with Run-Length Encoding (RLE) for maximum compression.link

If you like to delete those files, first use the below code snippet to get the details of .gz file.

unzipping the gz2 file

gunzip -c test.bz2 | hadoop fs -put - /path/filepath

to read the contents

hadoop fs -text /path_for_hdfs/test.bz2 | hadoop fs -put /hdfs_path/abc.txt
jose praveen
  • 1,298
  • 2
  • 10
  • 17
  • I don't really want to have to unzip all the files though...the question is can I guarantee a 14 byte bz2 is empty? – tex94 Jul 25 '17 at 12:40
0

I created an empty text file and compressed with bzip2 and inserted into hdfs. The size for the empty bzip2 file was 14B. When I did same with the non-empty file (only one character) it was 39bytes.

my conclusion is that all 14B bzip2 file will be empty.

make your own desition based on test cases...

enter image description here

Rahul
  • 459
  • 2
  • 13
  • But can anybody explain WHY a 14 byte bz2 file will always decompress to an empty file? I.e. Your answer seems to be based on inductive reasoning but where potential data loss is concerned I would feel more secure with an answer based on deductive reasoning. – tex94 Jul 25 '17 at 12:55
  • "WHY a 14 byte bz2 file will always decompress to an empty file." An empty file which is compressed using bzip2 will be 14B. but I don't know the exact reason for that 14B size – Rahul Jul 25 '17 at 12:59
  • I accept that, but might a nonempty file also compress to 14 bytes for some reason? It is not enough to show that a single character compresses to more than 14 bytes as a compression algorithm could easily be postulated that compresses say 8, 16 or 32 copies of a character to less space than a single character. I am hoping that someone with domain specific knowledge of bz2 can supply an answer... – tex94 Jul 25 '17 at 13:18