What's the difference between two concatentated bz2 files and one bz2 file made from two concatenated files?

Question

If I have two text files, one and two, what's the difference between:

bz2 one two -c >out.bz2

...and...

cat one two | bzip2 -c >out.bz2

?

Specifically, I'm generating bz2 files using pbzip2, putting them on HDFS, then reading them from pig, and I'm hitting MAPREDUCE-477. I can't upgrade my hadoop cluster from version 0.20, using a non-parallel bz2 implementation is too slow and I want to use a non-block compression algorithm.

Is there any way I can convert a concatenated bz2 file into a non-concatenated one? Or even, how would I modify pbzip2 so it generates non-concatenated bz2 files?

Thanks -

Brendan · Accepted Answer · 2013-02-06T01:42:18.047

Often compression works by replacing patterns with something shorter. For example, if you have "Hello there, goodbye there" then you might replace the second "there" with a reference to the first (where the reference is smaller than the original 5 bytes).

Now imagine if you have 2 files, one that contains "Hello there" and another that contains "Goodbye there". If you concatenate then compress, then the compression has more data to work with and can replace the second "there" with a reference to the first. If you compress both files separately and then concatenate this can't happen.

Now imagine if you concatenate then compress, such that the second "there" (from the second file) is replaced with a reference to the first "there" (from the first file); and then try to split the compressed data back into 2 compressed files. What you'd end up with is a 2 files where the second file has a reference to something that doesn't exist in that file, which can't be decompressed.

Note: Modern compression techniques are a lot more complex than what I described above - I oversimplified a lot to illustrate.

If you need to compress and decompress a large amount of data in parallel, then it can't be done. Instead you need to split the large amount of data into small pieces; so that small piece can be compressed/decompressed separately and many small pieces can be compressed/decompressed in parallel.

Yes, but bz2 is a block compression algorithm, so there shouldn't be dependencies between blocks? They might have different dictionaries, but I'm not sure how that results in some applications (e.g. MAPREDUCE-477) only reading the blocks of the first file? — Nicholas White, Feb 06 '13 at 01:41
If a block is 1000 bytes and the first file is 1300 bytes and the second file is 1700 bytes; then guess what that block in middle is going to contain when files are concatenated then compressed.. — Brendan, Feb 06 '13 at 01:43

What's the difference between two concatentated bz2 files and one bz2 file made from two concatenated files?

1 Answers1