Crash safe on-the-fly compression with GZipStream

Question

I'm compressing a log file as data is written to it, something like:

using (var fs = new FileStream("Test.gz", FileMode.Create, FileAccess.Write, FileShare.None))
{
  using (var compress = new GZipStream(fs, CompressionMode.Compress))
  {
    for (int i = 0; i < 1000000; i++)
    {
      // Clearly this isn't what is happening in production, just 
      // a simply example
      byte[] message = RandomBytes();
      compress.Write(message, 0, message.Length);

      // Flush to disk (in production we will do this every x lines, 
      // or x milliseconds, whichever comes first)
      if (i % 20 == 0)
      {
        compress.Flush();
      }
    }
  }
}

What I want to ensure is that if the process crashes or is killed, the archive is still valid and readable. I had hoped that anything since the last flush would be safe, but instead I am just ending up with a corrupt archive.

Is there any way to ensure I end up with a readable archive after each flush?

Note: it isn't essential that we use GZipStream, if something else will give us the desired result.

Why not write the files out in full fat mode, and compress in scheduled batches once they're on disk? It's going to be a whole bunch safer. — spender, Mar 27 '13 at 11:26
Because we want to keep disk utilisation low. But we also want to maximise the amount of readable data if the process dies — Cocowalla, Mar 27 '13 at 11:30
I'm also not sure if we could keep up if we were to compress in batches - files are rolled every minute, but each file can be up to around 500MB uncompressed — Cocowalla, Mar 27 '13 at 11:33
Yeah, that's quite a ferocious weight of data. I see your angle... I'm just not sure it's possible. I suppose you could farm the log writing to another process and send logging to it via some sort of IPC (WCF perhaps). If you can't guarantee the stability of your main executable, then make a more stable logwriter. Still won't protect you against a power-cut though. — spender, Mar 27 '13 at 11:47
The executable should be pretty stable, but I want to maximise available data if the unexpected happens — Cocowalla, Mar 27 '13 at 11:59
After some experimentation, I've discovered that it will work if I close and reopen the FileStream and GZipStream every time I flush - but then the performance is *miserable* — Cocowalla, Mar 27 '13 at 13:28
Another option is to enable compression at the file system level--let Windows handle the compression. You can enable compression for an entire disk, or for particular folders. — Jim Mischel, Mar 27 '13 at 13:47
@JimMischel that's actually a refreshingly different way of approaching the problem, you should post as an answer! — Cocowalla, Mar 27 '13 at 14:08
GZipStream.Flush does nothing. Source: http://msdn.microsoft.com/SV-SE/library/system.io.compression.gzipstream.flush.aspx — sisve, Mar 27 '13 at 15:14
@SimonSvensson gah, you're right - it seems to handle flushing by itself, and you don't seem to have any control over when it flushes (unless you close the stream) :( — Cocowalla, Mar 27 '13 at 15:40

score 2 · Accepted Answer · answered Mar 27 '13 at 14:19

2

An option is to let Windows handle the compression. Just enable compression on the folder where you're storing your log files. There are some performance considerations you should be aware of when copying the compressed files, and I don't know how well NT compression performs in comparision to GZipStream or other compression options. You'll probably want to compare compression ratios and CPU load.

There's also the option of opening a compressed file, if you don't want to enable compression on the entire folder. I haven't tried this, but you might want to look into it: http://social.msdn.microsoft.com/forums/en-US/netfxbcl/thread/1b63b4a4-b197-4286-8f3f-af2498e3afe5

answered Mar 27 '13 at 14:19

Jim Mischel

131,090
20
188
351

Really interesting way of looking at it, and thanks for the extra info (esp. about performance considerations) – Cocowalla Mar 27 '13 at 14:28
Accepting this answers as it gives me control over exactly when data is flushed to disk, and ensures files are readable if the process crashes (without having to run any repair tools) – Cocowalla Mar 29 '13 at 15:03

score 1 · Answer 2 · answered Mar 27 '13 at 13:32

1

Good news: GZip is a streaming format. Therefore corruption at the end of the stream cannot affect the beginning which was already written.

So even if your streaming writes are interrupted at an arbitrary point, most of the stream is still good. You can write yourself a little tool that reads from it and just stops at the first exception it sees.

If you want an error-free solution I'd recommend splitting the log into one file every x seconds (maybe x = 1 or 10?). Write into a file with extensions ".gz.tmp" and rename to ".gz" after the file was completely written and closed.

answered Mar 27 '13 at 13:32

usr

168,620
35
240
369

We already split the roll the log files once a minute. I'll need to look more into 'error recovery' you suggest. – Cocowalla Mar 27 '13 at 14:10
Aha, I did some testing on corrupted archives (after the process was killed) using `gunzip < Test.gz > Test.recovered.txt`, and it was indeed able to recover most of the data – Cocowalla Mar 27 '13 at 14:35

Mark Adler · Answer 3 · 2013-03-27T15:18:45.947

1

Yes, but it's more involved than just flushing. Take a look at gzlog.h and gzlog.c in the zlib distribution. It does exactly what you want, efficiently adding short log entries to a gzip file, and always leaving a valid gzip file behind. It also has protection against crashes or shutdowns during the process, still leaving a valid gzip file behind and not losing any log entries.

I recommend not using GZIPStream. It is buggy and does not provide the necessary functionality. Use DotNetZip instead as your interface to zlib.

edited Mar 27 '13 at 15:18

answered Mar 27 '13 at 14:56

Mark Adler

101,978
13
118
158

I just tried DotNetZip, and get exactly the same results as with GZIPStream. It used a bit less CPU, but that was the only difference. BTW, I'm on .NET 4.5, and I believe GZIPStream actually uses the zlib library under the hood now! – Cocowalla Mar 27 '13 at 15:39
Yes, .NET 4.5 at least fixed that. However the interface is more limited than DotNetZip's. – Mark Adler Mar 27 '13 at 15:58

Crash safe on-the-fly compression with GZipStream

3 Answers3