7

I am trying to save big UInt16 array into a file. positionCnt is about 50000, stationCnt is about 2500. Saved directly, without GZipStream, the file is about 250MB which can be compressed by external zip program to 19MB. With the following code the file is 507MB. What do I do wrong?

GZipStream cmp = new GZipStream(File.Open(cacheFileName, FileMode.Create), CompressionMode.Compress);
BinaryWriter fs = new BinaryWriter(cmp);
fs.Write((Int32)(positionCnt * stationCnt));
for (int p = 0; p < positionCnt; p++)
{
    for (int s = 0; s < stationCnt; s++)
    {
       fs.Write(BoundData[p, s]);
    }
}
fs.Close();
danatel
  • 4,844
  • 11
  • 48
  • 62

2 Answers2

12

Not sure what version of .NET you're running on. In earlier versions, it used a window size that was the same size as the buffer that you wrote from. So in your case it would try to compress each integer individually. I think they changed that in .NET 4.0, but haven't verified that.

In any case, what you want to do is create a buffered stream ahead of the GZipStream:

// Create file stream with 64 KB buffer FileStream fs = new FileStream(filename, FileMode.Create, FileAccess.Write, FileShare.None, 65536); GZipStream cmp = new GZipStream(fs, CompressionMode.Compress); ...

GZipStream cmp = new GZipStream(File.Open(cacheFileName, FileMode.Create), CompressionMode.Compress);
BufferedStream buffStrm = new BufferedStream(cmp, 65536);
BinaryWriter fs = new BinaryWriter(buffStrm);

This way, the GZipStream gets data in 64 Kbyte chunks, and can do a much better job of compressing.

Buffers larger than 64KB won't give you any better compression.

Jim Mischel
  • 131,090
  • 20
  • 188
  • 351
  • .Net 4, uncompressed is 250MB, compressed 1-short at a time (regardless of buffer) is 411MB, compressed 2500-shorts at a time is 165MB. – user7116 Sep 28 '11 at 21:14
  • Thank you for the suggestion. But it does not help. The result with larger buffer is about the same (517MB - I also changed the content of the array to speed up experiments). Also there is a problem with the name fs you have used in your example - fs is the BinnaryFormatter (this is my fault, names fs and cmp used by me are confusing). – danatel Sep 28 '11 at 21:29
  • @danatel: My mistake. I put the buffer on the wrong end. See my correction that uses `BufferedStream`. – Jim Mischel Sep 28 '11 at 21:48
  • Thank you, that helped - the result with my data and 65k buffer is 57MB; with initial content (the same UInt16 repeated over the buffer) 3MB, with my data and 20MB buffer the result is 50MB. – danatel Sep 29 '11 at 06:22
  • A 20MB buffer made a big improvement? They must have made some major changes for .NET 4.0. I'll have to give `GZipStream` another look. – Jim Mischel Sep 29 '11 at 14:40
  • Is this optimisation (using a BufferedStream) true for Decompression as well? – rollsch Feb 08 '18 at 02:54
  • @rolls yes, the buffered stream will improve decompression performance. – Jim Mischel Feb 08 '18 at 02:56
  • What if the byte array is already entirely in memory? That would not be a benefit then would it? – rollsch Feb 09 '18 at 22:33
  • @rolls The buffered stream improves the speed of writing to or reading from disk. If you're working strictly in memory, then the buffered stream would probably slow you down. – Jim Mischel Feb 11 '18 at 21:20
3

For whatever reason, which is not apparent to me during a quick read of the GZip implementation in .Net, the performance is sensitive to the amount of data written at once. I benchmarked your code against a few styles of writing to the GZipStream and found the most efficient version wrote long strides to the disk.

The trade-off is memory in this case, as you need to convert the short[,] to byte[] based on the stride length you'd like:

using (var writer = new GZipStream(File.Create("compressed.gz"),
                                   CompressionMode.Compress))
{
    var bytes = new byte[data.GetLength(1) * 2];
    for (int ii = 0; ii < data.GetLength(0); ++ii)
    {
        Buffer.BlockCopy(data, bytes.Length * ii, bytes, 0, bytes.Length);
        writer.Write(bytes, 0, bytes.Length);
    }

    // Random data written to every other 4 shorts
    // 250,000,000 uncompressed.dat
    // 165,516,035 compressed.gz (1 row strides)
    // 411,033,852 compressed2.gz (your version)
}
user7116
  • 63,008
  • 17
  • 141
  • 172
  • Thank you for your suggestion. I do not know what array content did you use for your benchmark. My content is pretty regular and can be compressed down to 8MB. 165MB is too much. – danatel Sep 28 '11 at 21:35
  • `data[ii, jj] = random.Next()` for half the data (~125MB). I was merely pointing out the differences in compression using 1-short versus 1-row at a time. – user7116 Sep 28 '11 at 21:44
  • That explains the difference - random noise is not as compressible as my quite regular data. Thank you for your help. – danatel Sep 29 '11 at 06:26