Piecemeal bzcompression for large files in PHP

Question

Creating bzip2 archived data in PHP is very easy thanks to its implementation in bzcompress. In my present application I cannot in all reason simply read the input file into a string and then call bzcompress or bzwrite. The PHP documentation does not make it clear whether successive calls to bzwrite with relatively small amounts of data will yield the same result as when compressing the whole file in one single swoop. I mean something along the lines of

$data = file_get_contents('/path/to/bigfile');
$cdata = bzcompress($data);

I tried out a piecemeal bzcompression using the routines shown below

function makeBZFile($infile,$outfile)
{
 $fp = fopen($infile,'r');
 $bz = bzopen($outfile,'w');
 while (!feof($fp))     
 {
  $bytes = fread($fp,10240);
  bzwrite($bz,$bytes);
 }
 bzclose($bz);
 fclose($fp);
}

function unmakeBZFile($infile,$outfile)
{
 $bz = bzopen($infile,'r');
 while (!feof($bz))
 {
  $str = bzread($bz,10240);
  file_put_contents($outfile,$str,FILE_APPEND);
 }
}

set_time_limit(1200);
makeBZFile('/tmp/test.rnd','/tmp/test.bz');
unmakeBZFile('/tmp/test.bz','/tmp/btest.rnd');

To test this code I did two things

I used makeBZFile and unmakeBZFile to compress and then decompress a SQLite database - which is what I need to do eventually.
I created a 50Mb filled with random data dd if=/dev/urandom of='/tmp.test.rnd bs=50M count=1

In both cases I performed a diff original.file decompressed.file and found that the two were identical.

All very nice but it is not clear to me why this is working. The PHP docs state that bzread(bzpointer,length) reads a maximum length bytes of UNCOMPRESSED data. If my code below is woring it is because I am forcing the bzwite and bzread size to 10240 bytes.

What I cannot see is just how bzread knows how to fetch lenth bytes of UNCOMPRESSED data. I checked out the format of a bzip2 file. I cannot see tht there is anything there which helps easily establish the uncompressed data length for a chunk of the .bz file.

I suspect there is a gap in my understanding of how this works - or else the fact that my code below appears to perform a correct piecemeal compression is purely accidental.

I'd much appreciate a few explanations here.

score 3 · Accepted Answer · answered Dec 14 '15 at 11:11

To understand how the decompression get the length of bytes you have to understand first the compression. It seems that you don't know any thing about compression algorigthim.

BZIP2

Crucial algorithm of BZIP2 is the Burrows Wheeler transformation (BWT), that converts the original data into a suitable form for following coding. The current version applies a Huffman code. Compression algorithm processes the data in blocks totally independent from each block. Block sizes can be set in a range from 1-9 (100,000 - 900,000 bytes).

BZIP2 Data Structure

The first two character of compressed string start with letter 'BZ' and thereafter 1 byte for algorigthim used. Thereafter identification of the block size immediately follows, being valid for the entire file (h1, h2, h3 to h9). The parameter indicates the block size in units from 1-9 (100,000 - 900,000 bytes).

Actual original data are stored in blocks according to the selected size and will be protected individually with a CRC32 checksum. Additionally a 48 bit identifier introduces each block. This block structure allows a partial reconstruction of damaged files.

GZIP/BZIP

Gzip and bzip2 are functionally equivalent. One advantage of GZIP is that it can compress a stream, a sequence where you can't look behind. This makes it the official compressor of http streams. GZZIP DEFLATE RFC 1951 Compressed Data Format Specification and GUNZIP RFC 1952 File Format Specification are published documents.

GIP explained

thank you for the answer. You might have noticed that in my question I provide a link to the BZIP file format which I had studied prior to posing the question. Your answer helps to understand how the `bzwrite` writes data piecemeal. It is less clear to me how `bzread` manages to read the specified number of *uncompressed* bytes. Given that the degree of compression will vary depending on the data in each block it isn't as straightforward as thinking "*he wants X bytes of uncompressed data so let me just fetch the next X/uncompressed_size blocks" — DroidOS, Dec 16 '15 at 13:57
Its not the straight jacket formula for reading bytes in uncompression bytes. First the Huffman tree is decoded inmemory and according to tree the compresses data is uncompressed. — Vineet1982, Dec 17 '15 at 12:15
Anything more you have to know just let me know or accept the answer — Vineet1982, Dec 17 '15 at 12:37

Piecemeal bzcompression for large files in PHP

1 Answers1

BZIP2

BZIP2 Data Structure

GZIP/BZIP

GIP explained