How do I extract all the data from a bzip2 archive with C?

Question

I have a concatenated file made up of some number of bzip2 archives. I also know the sizes of the individual bzip2 chunks in that file.

I would like to decompress a bzip2 stream from an individual bzip2 data chunk, and write the output to standard output.

First I use fseek to move the file cursor to the desired archive byte, and then read the "size"-chunk of the file into a BZ2_bzRead call:

int headerSize = 1234;
int firstChunkSize = 123456;
FILE *fp = fopen("pathToConcatenatedFile", "r+b");
char *bzBuf = malloc(sizeof(char) * firstChunkSize);
int bzError, bzNBuf;
BZFILE *bzFp = BZ2_bzReadOpen(&bzError, *fp, 0, 0, NULL, 0);

# move cursor past header of known size, to the first bzip2 "chunk"
fseek(*fp, headerSize, SEEK_SET); 

while (bzError != BZ_STREAM_END) {
    # read the first chunk of known size, decompress it
    bzNBuf = BZ2_bzRead(&bzError, bzFp, bzBuf, firstChunkSize);
    fprintf(stdout, bzBuf);
}

BZ2_bzReadClose(&bzError, bzFp);
free(bzBuf);
fclose(fp);

The problem is that when I compare the output of the fprintf statement with output from running bzip2 on the command line, I get two different answers.

Specifically, I get less output from this code than from running bzip2 on the command line.

More specifically, my output from this code is a smaller subset of the output from the command line process, and I am missing what is in the tail-end of the bzip2 chunk of interest.

I have verified through another technique that the command-line bzip2 is providing the correct answer, and, therefore, some problem with my C code is causing output at the end of the chunk to go missing. I just don't know what that problem is.

If you are familiar with bzip2 or libbzip2, can you provide any advice on what I am doing wrong in the code sample above? Thank you for your advice.

Are there any ASCII NUL bytes (zero bytes) in the file? Are there any percent characters in the data? You're using `fprintf()` very dangerously - you should probably use `fputs()` or even `fwrite()`, but failing that, use `fprintf(stdout, "%s", bzBuf);`. — Jonathan Leffler, Oct 12 '10 at 06:45
I get the same result with `fputs(bzBuf, stdout)` or `fprintf(stdout, "%s", bzBuf)`. To my knowledge the `bzip2`-ed chunks are alphanumeric, newline and tab characters. There are no percent symbols or null characters in the input that went into making the `bzip2` chunks, which can be uncompressed successfully with the `bzip2` command-line tool — Alex Reynolds, Oct 12 '10 at 06:51
OK - you're lucky on the percent symbols; the absence of zero bytes is not so unusual. Have you summed the number of bytes returned in bzNBuf, to see whether what the read operations return adds up to what you get in your output, or what bzip2 gets when it is dealing with the file? Your comment about 'read the first chunk of known size' is misleading - after it has read one chunk, anyway. Is the file you're decompressing bigger than firstChunkSize? — Jonathan Leffler, Oct 12 '10 at 07:00
Also, you should probably check bzError after the `BZ2_bzRead()` and before the `fprintf()`. However, that is more likely to end up with extra data than too little data. (Are you sure that your process is producing a smaller result than bzip2 is?) — Jonathan Leffler, Oct 12 '10 at 07:04
Regardless of where I `fseek` in my concatenated file, I get a smaller result. Is the `size` parameter in the API referring to the size of the `bzip2` chunk being read, or the size of the uncompressed output? From reading the API, it seems like the first option, but I could well be wrong. — Alex Reynolds, Oct 12 '10 at 07:11
Finally from me for tonight: since I see no information about BZ2_bzRead() null terminating its output, I think you should really worry about the lengths returned and whether you are miswriting the data. But the most likely ways to mismanage it end up with more output rather than less, so I'm not certain that's the trouble. — Jonathan Leffler, Oct 12 '10 at 07:12
Well, nearly finally - the [LIBBZ2](http://www.bzip.org/1.0.3/html/hl-interface.html) manual says that BZ2_bzRead() 'Reads up to len (uncompressed) bytes from the compressed file'. That makes sense; you tell it how much space is in your output buffer, and it does not go trampling beyond the end of that space with the uncompressed output. — Jonathan Leffler, Oct 12 '10 at 07:14
You're right, and I think another problem is that I am using `strtok` on `bzBuf`, tokenizing on newlines, when `bzBuf` probably doesn't have all the data I need in one shot. It's making sense why I am losing what's at the end — the `bzBuf` size is the size of the `bzip2` archive, not the uncompressed data, and so on the next pouring-into-`bzBuf`, data are not split across newlines. — Alex Reynolds, Oct 12 '10 at 07:50
shouldn't your call to ```BZ2_bzReadOpen()``` use ```fp``` instead of ```*fp```? — mwag, Sep 27 '19 at 04:36

Roland Illig · Accepted Answer · 2010-10-12T07:51:12.963

This is my source code:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#include <bzlib.h>

int
bunzip_one(FILE *f) {
  int bzError;
  BZFILE *bzf;
  char buf[4096];

  bzf = BZ2_bzReadOpen(&bzError, f, 0, 0, NULL, 0);
  if (bzError != BZ_OK) {
    fprintf(stderr, "E: BZ2_bzReadOpen: %d\n", bzError);
    return -1;
  }

  while (bzError == BZ_OK) {
    int nread = BZ2_bzRead(&bzError, bzf, buf, sizeof buf);
    if (bzError == BZ_OK || bzError == BZ_STREAM_END) {
      size_t nwritten = fwrite(buf, 1, nread, stdout);
      if (nwritten != (size_t) nread) {
        fprintf(stderr, "E: short write\n");
        return -1;
      }
    }
  }

  if (bzError != BZ_STREAM_END) {
    fprintf(stderr, "E: bzip error after read: %d\n", bzError);
    return -1;
  }

  BZ2_bzReadClose(&bzError, bzf);
  return 0;
}

int
bunzip_many(const char *fname) {
  FILE *f;

  f = fopen(fname, "rb");
  if (f == NULL) {
    perror(fname);
    return -1;
  }

  fseek(f, 0, SEEK_SET);
  if (bunzip_one(f) == -1)
    return -1;

  fseek(f, 42, SEEK_SET); /* hello.bz2 is 42 bytes long in my case */
  if (bunzip_one(f) == -1)
    return -1;

  fclose(f);
  return 0;
}

int
main(int argc, char **argv) {
  if (argc < 2) {
    fprintf(stderr, "usage: bunz <fname>\n");
    return EXIT_FAILURE;
  }
  return bunzip_many(argv[1]) != 0 ? EXIT_FAILURE : EXIT_SUCCESS;
}

I cared very much about proper error checking. For example, I made sure that bzError was BZ_OK or BZ_STREAM_END before trying to access the buffer. The documentation clearly says that for other values of bzError the returned number is undefined.
It shouldn't frighten you that about 50 percent of the code are concerned with error handling. That's how it should be. Expect errors everywhere.
The code still has some bugs. In case of errors it doesn't release the resources (f, bzf) properly.

And these are the commands I used for testing:

$ echo hello > hello
$ echo world > world
$ bzip2 hello
$ bzip2 world
$ cat hello.bz2 world.bz2 > helloworld.bz2
$ gcc -W -Wall -Os -o bunz bunz.c -lbz2
$ ls -l *.bz2
-rw-r--r-- 1 roland None 42 Oct 12 09:26 hello.bz2
-rw-r--r-- 1 roland None 86 Oct 12 09:36 helloworld.bz2
-rw-r--r-- 1 roland None 44 Oct 12 09:26 world.bz2
$ ./bunz.exe helloworld.bz2 
hello
world

How do I extract all the data from a bzip2 archive with C?

1 Answers1