1

I am tring to compress an ascii string (base64) using GZip, however, it producing more data instead of less data. Can anyone help?

It is an old project and I'm limited to the compilers and Framework versions. I have tried MSBuild 2.0, 3.5 & 4.0 - all produce the same erroneous results.

Imports System.IO.Compression

Private Function GZipString(ByVal asciiString As String) as Byte()

    Debug.Print ("asciiString length : {0}", asciiString.Length )
    Dim asciibytes As Byte() = Encoding.ASCII.GetBytes(asciiString)
    Debug.Print ("asciibytes length : {0}", asciibytes.Length )

    'GZip the string
    Dim ms As New MemoryStream()
    Dim gzips As New GZipStream(ms, CompressionMode.Compress)
    gzips.Write(asciibytes, 0, asciibytes.Length)
    gzips.Close()

    GZipString = ms.ToArray
    ms.Close()
    Debug.Print ("compressedBytes length : {0}", GZipString.Length )

End Function

The output I am getting is:-

  • asciiString length : 3607
  • asciibytes length : 3607
  • compressedBytes length : 3985
42LeapsOfFaith
  • 146
  • 1
  • 9
  • What makes you think the results are erroneous? If the data contains none of the patterns that the compression algorithm knows how to compress, the result will not be smaller and may even be larger because of additional data added by the compression algorithm. – Blackwood Jun 29 '19 at 02:19
  • @Blackwood - The decompressor is complaining that it can't decompress the data. Debugging has lead me back to the compressor. Base64 normally compresses very well. If I capture the Base64 string, store it in a file (MSDOS text format), zip it - it shrinks to 2810 bytes. – 42LeapsOfFaith Jun 29 '19 at 02:41
  • If the compressed data can't be decompressed, that is significant information about the problem and I think it would be a good idea for you to edit the question to include it. – Blackwood Jun 29 '19 at 02:49
  • Testing this code using a source string of `3018` bytes, the generated Base64 string is `4024` bytes. The compressed Base64 string outputs a byte array of `1418` bytes. Btw, you're not disposing of the streams. You have to. – Jimi Jun 29 '19 at 10:03
  • @Jimi - thanks for testing the code. I'll have to look deeper for what is broken. Do you know what MSBuild & Framework versions you used? – 42LeapsOfFaith Jun 29 '19 at 11:11
  • MSBuild `15.9.21.664` (not really relevant), .Net FW `4.7.2` (more relevant). Testing on .Net `3.5`, the resulting byte array length, using the same input, is `1804` instead of `1418`. Base64 conversion: `Dim base64TestBytes As Byte() = Encoding.UTF8.GetBytes(testString) Dim base64TestString As String = Convert.ToBase64String(base64TestBytes, Base64FormattingOptions.None)` – Jimi Jun 29 '19 at 11:25
  • Are you certain that you are having the issue when compiling against Framework version 4? The earlier versions were known to inflate already compressed data. See: [What's New in the .NET Framework 4 - Other New Features](https://learn.microsoft.com/en-us/previous-versions/dotnet/netframework-4.0/ms171868(v=vs.100)#other-new-features). – TnTinMn Jun 29 '19 at 13:48
  • Thanks all for the comments. Using FW 4.0 I get - asciiString length : 3607, asciibytes length : 3607, compressedBytes length : 3635. This is an old project and won't run above FW 3.5. As a workaround, I'm going to write the string to a file, shell to compress.exe, read it back... Unless someone has a better idea? – 42LeapsOfFaith Jun 30 '19 at 00:18
  • I would worry about getting data that round trips properly before worrying about size issues. -- Anyway, I know the exact order of the flush / close, etc. matters a lot to `GZipStream` it's very finicky, so I posted code that is tested and working below. (Also, please stop using the `ASCII` encoder!) – BrainSlugs83 Sep 03 '19 at 20:07

1 Answers1

3

1.) This is 2019, use UTF8, not ASCII; UTF8 is 99% compatible, with the added support of international characters (kanji, emojis, anything you can type). Most web browsers and servers even default to the UTF8 encoding format now (they have for years). You should generally avoid ASCII unless you are working with legacy software that requires it... (and even then, only for the part of the code that talks to that legacy system!)

2.) For reference, compressing a string won't always result in less bytes; especially with small strings; a 3600 character string though should definitely compress (unless it contains completely random garbage, a human typed plain text string of that length should definitely compress).

3.) You should be properly disposing of objects (via using statements). Not doing so can lead to resource and/or memory leakage.

4.) Either your compression or your decompression code is wrong; and GZipStream can be super finicky; so, I've included tested code that works for C# and VB.NET below.

C# Version:

static void Main(string[] args)
{
    var input = string.Join(" ", args);
    var compressedBytes = CompressString(input);
    var dec = DecompressString(compressedBytes);

    Console.WriteLine("Input Length        = " + input.Length);                          // 537
    Console.WriteLine("Uncompressed Size   = " + Encoding.UTF8.GetBytes(input).Length);  // 539
    Console.WriteLine("Compressed Size     = " + compressedBytes.Length);                // 354 (smaller!)
    Console.WriteLine("Decompressed Length = " + dec.Length);                            // 537 (same size!)
    Console.WriteLine("Roundtrip Successful: " + (input == dec));                        // True
}

public static string DecompressString(byte[] bytes)
{
    using (var ms = new MemoryStream(bytes))
    using (var ds = new GZipStream(ms, CompressionMode.Decompress))
    using (var sr = new StreamReader(ds))
    {
        return sr.ReadToEnd();
    }
}

public static byte[] CompressString(string input)
{
    using (var ms = new MemoryStream())
    using (var cs = new GZipStream(ms, CompressionLevel.Optimal))
    {
        var bytes = Encoding.UTF8.GetBytes(input);
        cs.Write(bytes, 0, bytes.Length);

        // *REQUIRED* or last chunk will be omitted. Do NOT call any other close or
        // flush method.
        cs.Close();

        return ms.ToArray();
    }
}

VB.NET Version

(gross, I feel dirty ):

Sub Main(args As String())
    Dim input As String = String.Join(" ", args)
    Dim compressedBytes As Byte() = CompressString(input)
    Dim dec As String = DecompressString(compressedBytes)

    Console.WriteLine("Input Length        = " & input.Length)                          ' 537
    Console.WriteLine("Uncompressed Size   = " & Encoding.UTF8.GetBytes(input).Length)  ' 539
    Console.WriteLine("Compressed Size     = " & compressedBytes.Length)                ' 354 (smaller!)
    Console.WriteLine("Decompressed Length = " & dec.Length)                            ' 537 (same size!)
    Console.WriteLine("Roundtrip Successful: " & (input = dec).ToString())              ' True
End Sub

Public Function DecompressString(ByVal bytes As Byte()) As String

    Using ms = New MemoryStream(bytes)
        Using ds = New GZipStream(ms, CompressionMode.Decompress)
            Using sr = New StreamReader(ds)

                Return sr.ReadToEnd()

            End Using
        End Using
    End Using

End Function

Public Function CompressString(input As String) As Byte()

    Using ms = New MemoryStream
        Using cs = New GZipStream(ms, CompressionLevel.Optimal)

            Dim bytes As Byte() = Encoding.UTF8.GetBytes(input)
            cs.Write(bytes, 0, bytes.Length)

            ' *REQUIRED* Or last chunk will be omitted. Do Not call any other close Or
            ' flush method.
            cs.Close()

            Return ms.ToArray()

        End Using
    End Using

End Function

Edit:

For .NET 3.5, this still works (and produces a smaller object; though not as small as 4.8, it only compresses down to 497 bytes instead of 354 bytes with my sample data).

You just need to change CompressionLevel.Optimal to CompressionMode.Compress.

Community
  • 1
  • 1
BrainSlugs83
  • 6,214
  • 7
  • 50
  • 56