0

One of our products at work involve a files with the following structure within them:

A STRING WITH SOME CONTENT IDENTIFYING THE FILES CONTENTS
A STRING ON ROW 2
A STRING ON ROW 3
A STRING ON ROW 4
<binary data starts here and is gzipped>

Now if I do this, I can decompress the content and recreate an uncompressed version of the same file:

INPUT=FILEA.COMPRESSED
OUTPUT=FILEB.UNCOMPRESSED
head -n5 $INPUT > $OUTPUT
cat $INPUT | tail --lines=+5 | gunzip >> $OUTPUT

# At this point I'm left with a file structure as follows:
A STRING WITH SOME CONTENT IDENTIFYING THE FILES CONTENTS
A STRING ON ROW 2
A STRING ON ROW 3
A STRING ON ROW 4
<uncompressed content>

I'm trying to accomplish this same feat with boost. Boost is always throwing a gzip_error code 4 which gzip.hpp reveals as bad_header.

The files I'm working no doubt are not bulletproof and are produced by a very old legacy system.

My main question: If gunzip can do it... is there a tweak or flag I'm overlooking with boost that can have it do it as well?

The C++ code that is failing looks like this (greatly simplified to focus on the point so it may contain syntax errors):

#include <boost/iostreams/filtering_stream.hpp>
#include <boost/iostreams/copy.hpp>
#include <boost/iostreams/filter/gzip.hpp>
#include <sstream>
#include <iostream>
#include <fstream>

// Open File
ifstream file("myfile", ios::in|ios::binary);

int line = 1;
char c;
while (!file.eof() && line < 5){
   // I do do 'way' more error checking and proper handling here
   // in real code, but you get the point.. I'm moving the cursor
   // past the last new line and the beginning of what is otherwise
   // compressed content.
   file.get(c);
   if(c == '\n')line++;
}

stringstream ss;
// Store rest of binary data into stringstream
while(!file.eof()){
   file.get(c);
   ss.put(c);
}
// Close File
file.close();

// Return file pointer to potential gzip stream
ss.seekg(0, ios::beg);
try
{
   stringstream gzipped(ss.str());
   io::filtering_istream gunzip;
   gunzip.push(io::gzip_decompressor());
   gunzip.push(gzipped);
   copy(gunzip, ss);
}
catch(io::gzip_error const&  ex)
   // always throws error code 4 here (bad_header)
   cout << "Exception: " << ex.error() << endl;

Here is some more helpful information that may help out:

  • OS: Redhat 5.7
  • Boost: boost-1.33.1-10 (el5 repository)
  • Platform: x86_64
  • GCC: version 4.1.2 20080704 (Red Hat 4.1.2-46)

My Makefile does have the following lines in the linker as well:

LDFLAGS = -lz -lboost_iostreams
Chris
  • 491
  • 7
  • 14

1 Answers1

0

I'm not sure if it's the root cause of your error, but your use of file.eof() is incorrect. The function returns true only after you've attempted to read past the end of file. It does NOT inform you if your next read will fail.

while(!file.eof()){ //1
   file.get(c);  // 2
   ss.put(c);    // 3
}

In this loop, if you read the last valid character on line 2, then output it on 3. It then tests the condition on line 1 again. Since you have not yet attempted to read the past the end of file, file.eof() returns false, so the loop condition is true. It then attempts to read the next character, which fails, leaving c unchanged. Line 3 then puts that invalid character into ss.

This results in an extra character at the end of the stream. I'm not certain if this is the sole problem, but it's probably one of them.

Edit:

Okay, after looking at it, I'm not 100% sure WHY it's happening, but it's because you're reusing the stringstream ss. Either call ss.seekp(0, ios::begin) prior to doing the copy, or use a separate stringstream.

Personally, instead of copying ss to gzipped, I would write directly into gzipped from the input file, and then output via the copy into ss.

Dave S
  • 20,507
  • 3
  • 48
  • 68
  • thanks for your input... with respect to your example above, i added `if(c == EOF)break`; to avoid storing the extra character, but I still get the same exception thrown... I should add that the first 2 characters of the compressed (part of the) file are 0x1f and 0x9d (added cout lines to check). Using a hex editor shows that I'm starting at the right part of the file to begin uncompressing). I was able to decompress by also downloading an old copy of 'uncompress'. – Chris Jan 23 '13 at 21:48
  • Dave, thanks again for taking extra time to help me debug my issue. Your suggestion would decrease a lot of overhead; but I needed to process it separately (in another stream) because there are cases where the received file is 'uncompressed'. I wanted to handle those cases by going back to the raw stream after the 'expected' exception from boost. As it turns out the compression algorithm isn't gzip, but gunzip is so clever it handles many formats in addition to gzip (I didn't realize this). It's some older version of Lempel-Ziv coding (LZ78) so it's understandable boost is failing to me now. – Chris Jan 29 '13 at 21:13