0

I'm having trouble reading through a GZipped BlueCoat log file. The first six lines of the file are a header, and these lines can be read perfectly, but none of the following content.

I have tried unzipping the log manually, and then trying to read the file with slightly modified code, and that works okay. I suspect this is an issue with ASCII versus UTF8 versus UTF16 but I cannot get to the bottom of this, especially since it seems to change mid-file.

Code I have at the moment is:

InputStream fileStream;
InputStream gzipStream;
Reader decoder;
BufferedReader thisBr;

try {
    fileStream  = new FileInputStream(currentFile);
    gzipStream  = new GZIPInputStream(fileStream);
    decoder     = new InputStreamReader(gzipStream, "UTF-8");
    thisBr      = new BufferedReader(decoder);                    

    String logLine = thisBr.readLine();
    while (logLine != null)
    {
        logWriter.write(logLine + "\n");
        logLine = thisBr.readLine();
    }
    logWriter.flush();
    gzipStream.close();
} catch (IOException e) {
    System.out.println("Exception has been thrown:" + e);
}
Audrius Meškauskas
  • 20,936
  • 12
  • 75
  • 93
MikeB
  • 580
  • 3
  • 18
  • What error message do you get? What does the content of the file look like? Can you make that available? Can you remove one layer after the other, i.e. read one char at a time from `decoder` and one byte at a time from `gzipStream` to see which level causes this? A [SSCCE](http://sscce.org/) would be great if you can get the file input small enough. – MvG Dec 13 '12 at 19:13
  • Firstly, I don't get any error messages - but the output file only contains the headers of each of the eighteen log files that I am using for testing. I've tried unzipping one of the log files, converting it to UTF-8 and back to ASCII, then re-zipping the file, and then my code works the way I would expect, which seems to confirm that it is actually my log files that are 'broken' but I don't understand how they could change charset in the middle, nor why every charset I specify handles the header section correctly, and fails to handle the actual log section. – MikeB Dec 13 '12 at 19:48
  • Are all 18 log files in the same compressed file? If so, then it cannot be pure GZIP as that format doesn't bundle multiple files. You can read stuff as `ISO-8859-1` to ensure that all bytes can be read. You might get garbage output for non-ascii characters, but no immediate errors. As you are using a buffered reader, errors might occur whenever the decoder reaches broken input, even if that is many lines away. But you should see an exception in that case. – MvG Dec 13 '12 at 19:53
  • #Software: SGOS 6.3.1.1 #Version: 1.0 #Start-Date: 2012-04-02 06:26:30 #Date: 2011-12-13 19:11:26 2012-04-02 06:26:30 1 10.187.16.184 90002555 - - OBSERVED "Email" http://du110w.dub110.mail.live.com/mail/InboxLight.aspx?n=1487632400 200 TCP_HIT GET image/gif http du110w.dub110.mail.live.com 80 /mail/clear.gif - gif "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)" 192.168.17.76 589 2778 - "Hotmail" "none" 2012-04-02 06:26:30 1 10.187.16.184 90002555 - - OBSERVED "Email" http://du110w.dub110.mail.live.com/mail/InboxLight.aspx?n=1487632400 200 TCP_HIT GET image/gif http du110w.dub ... – MikeB Dec 13 '12 at 19:53
  • No, each log is in a seperate gzip – MikeB Dec 13 '12 at 19:54
  • Last comment has lost its new-lines ... Each hash starts a new line, then the data lines start with a date-stamp. – MikeB Dec 13 '12 at 19:55
  • Not seeing any exceptions, have tried to explictity use UTF-8 and US-ASCII and ISO-8859-1 all seem to behave the same. Have also tried UTF-16, and that basically gives me garbage output, all double-spaced. – MikeB Dec 13 '12 at 20:00
  • For the record, I've now had to give up on this. :¬{ – MikeB Oct 09 '13 at 09:20

0 Answers0