0

I'm compressing a log file as data is written to it, something like:

using (var fs = new FileStream("Test.gz", FileMode.Create, FileAccess.Write, FileShare.None))
{
  using (var compress = new GZipStream(fs, CompressionMode.Compress))
  {
    for (int i = 0; i < 1000000; i++)
    {
      // Clearly this isn't what is happening in production, just 
      // a simply example
      byte[] message = RandomBytes();
      compress.Write(message, 0, message.Length);

      // Flush to disk (in production we will do this every x lines, 
      // or x milliseconds, whichever comes first)
      if (i % 20 == 0)
      {
        compress.Flush();
      }
    }
  }
}

What I want to ensure is that if the process crashes or is killed, the archive is still valid and readable. I had hoped that anything since the last flush would be safe, but instead I am just ending up with a corrupt archive.

Is there any way to ensure I end up with a readable archive after each flush?

Note: it isn't essential that we use GZipStream, if something else will give us the desired result.

Cocowalla
  • 13,822
  • 6
  • 66
  • 112
  • Why not write the files out in full fat mode, and compress in scheduled batches once they're on disk? It's going to be a whole bunch safer. – spender Mar 27 '13 at 11:26
  • Because we want to keep disk utilisation low. But we also want to maximise the amount of readable data if the process dies – Cocowalla Mar 27 '13 at 11:30
  • I'm also not sure if we could keep up if we were to compress in batches - files are rolled every minute, but each file can be up to around 500MB uncompressed – Cocowalla Mar 27 '13 at 11:33
  • 1
    Yeah, that's quite a ferocious weight of data. I see your angle... I'm just not sure it's possible. I suppose you could farm the log writing to another process and send logging to it via some sort of IPC (WCF perhaps). If you can't guarantee the stability of your main executable, then make a more stable logwriter. Still won't protect you against a power-cut though. – spender Mar 27 '13 at 11:47
  • The executable should be pretty stable, but I want to maximise available data if the unexpected happens – Cocowalla Mar 27 '13 at 11:59
  • After some experimentation, I've discovered that it will work if I close and reopen the FileStream and GZipStream every time I flush - but then the performance is *miserable* – Cocowalla Mar 27 '13 at 13:28
  • Another option is to enable compression at the file system level--let Windows handle the compression. You can enable compression for an entire disk, or for particular folders. – Jim Mischel Mar 27 '13 at 13:47
  • 1
    @JimMischel that's actually a refreshingly different way of approaching the problem, you should post as an answer! – Cocowalla Mar 27 '13 at 14:08
  • GZipStream.Flush does nothing. Source: http://msdn.microsoft.com/SV-SE/library/system.io.compression.gzipstream.flush.aspx – sisve Mar 27 '13 at 15:14
  • @SimonSvensson gah, you're right - it seems to handle flushing by itself, and you don't seem to have any control over when it flushes (unless you close the stream) :( – Cocowalla Mar 27 '13 at 15:40

3 Answers3

2

An option is to let Windows handle the compression. Just enable compression on the folder where you're storing your log files. There are some performance considerations you should be aware of when copying the compressed files, and I don't know how well NT compression performs in comparision to GZipStream or other compression options. You'll probably want to compare compression ratios and CPU load.

There's also the option of opening a compressed file, if you don't want to enable compression on the entire folder. I haven't tried this, but you might want to look into it: http://social.msdn.microsoft.com/forums/en-US/netfxbcl/thread/1b63b4a4-b197-4286-8f3f-af2498e3afe5

Jim Mischel
  • 131,090
  • 20
  • 188
  • 351
  • Really interesting way of looking at it, and thanks for the extra info (esp. about performance considerations) – Cocowalla Mar 27 '13 at 14:28
  • Accepting this answers as it gives me control over exactly when data is flushed to disk, and ensures files are readable if the process crashes (without having to run any repair tools) – Cocowalla Mar 29 '13 at 15:03
1

Good news: GZip is a streaming format. Therefore corruption at the end of the stream cannot affect the beginning which was already written.

So even if your streaming writes are interrupted at an arbitrary point, most of the stream is still good. You can write yourself a little tool that reads from it and just stops at the first exception it sees.

If you want an error-free solution I'd recommend splitting the log into one file every x seconds (maybe x = 1 or 10?). Write into a file with extensions ".gz.tmp" and rename to ".gz" after the file was completely written and closed.

usr
  • 168,620
  • 35
  • 240
  • 369
  • We already split the roll the log files once a minute. I'll need to look more into 'error recovery' you suggest. – Cocowalla Mar 27 '13 at 14:10
  • Aha, I did some testing on corrupted archives (after the process was killed) using `gunzip < Test.gz > Test.recovered.txt`, and it was indeed able to recover most of the data – Cocowalla Mar 27 '13 at 14:35
1

Yes, but it's more involved than just flushing. Take a look at gzlog.h and gzlog.c in the zlib distribution. It does exactly what you want, efficiently adding short log entries to a gzip file, and always leaving a valid gzip file behind. It also has protection against crashes or shutdowns during the process, still leaving a valid gzip file behind and not losing any log entries.

I recommend not using GZIPStream. It is buggy and does not provide the necessary functionality. Use DotNetZip instead as your interface to zlib.

Mark Adler
  • 101,978
  • 13
  • 118
  • 158
  • I just tried DotNetZip, and get exactly the same results as with GZIPStream. It used a bit less CPU, but that was the only difference. BTW, I'm on .NET 4.5, and I believe GZIPStream actually uses the zlib library under the hood now! – Cocowalla Mar 27 '13 at 15:39
  • Yes, .NET 4.5 at least fixed that. However the interface is more limited than DotNetZip's. – Mark Adler Mar 27 '13 at 15:58