-1

My string is a Json file (test.json) with the below content

{
  "objectId": "bbad4cc8-bce8-438e-8683-3e603d746dee",
  "timestamp": "2021-04-28T14:02:42.247Z",
  "variable": "temperatureArray",
  "model": "abc.abcdefg.abcdef",
  "quality": 5,
  "value": [ 43.471600438222104, 10.00940101687303, 39.925500606152, 32.34369812176735, 33.07786476010357 ]
}

I am compressing it as below

using ICSharpCode.SharpZipLib.GZip;
using System;
using System.Diagnostics;
using System.IO;
using System.Reflection;
using System.Text;

namespace GZipTest
{
    public static class SharpZipLibCompression
    {
        public static void Test()
        {
            Trace.WriteLine("****************SharpZipLib Test*****************************");
            var testFile = Path.Combine(Path.GetDirectoryName(Assembly.GetExecutingAssembly().Location), "test.json");
            var text = File.ReadAllText(testFile);
            var ipStringSize = System.Text.UTF8Encoding.Unicode.GetByteCount(text);
            var compressedString = CompressString(text);
            var opStringSize = System.Text.UTF8Encoding.Unicode.GetByteCount(compressedString);
            float stringCompressionRatio = (float)opStringSize / ipStringSize;
            Trace.WriteLine("String Compression Ratio using SharpZipLib" + stringCompressionRatio);
        }

        public static string CompressString(string text)
        {
            if (string.IsNullOrEmpty(text))
                return null;
            byte[] buffer = Encoding.UTF8.GetBytes(text);
            using (var compressedStream = new MemoryStream())
            {
                GZip.Compress(new MemoryStream(buffer), compressedStream, false);
                byte[] compressedData = compressedStream.ToArray();
                return Convert.ToBase64String(compressedData);
            }
        }
    }
}

But my compressed string size (opStringSize) is more than the original string size (ipStringSize). Why?

KBNanda
  • 595
  • 1
  • 8
  • 25
  • 2
    Compression comes with overheads. For a small string, those overheads might well be larger than any gains you make from compression, particularly if your string doesn't compress well – canton7 May 25 '21 at 15:14
  • @canton7 It looks like something fishy. Because when done using .net inbuilt 'System.IO.Compression.GZipStream' I am getting lower than the original size. At any cost it should never exceed the original size right? Am I missing something? – KBNanda May 25 '21 at 15:21
  • 2
    No, That's exactly what canton was telling you. Zipping has an overhead. Let's say it adds 50 to anything you zip. So, if you zip something that is originally 40, compressed 20, then it'll be 20+50 = 70 > 40. If you zip something that is 60k , compressed 40k, then it will be 40k+50, which is insignificantly more than the compressed size and still a lot less than uncompressed. – Fildor May 25 '21 at 15:23
  • 1
    @Fildor You are right. I experimented with a bigger string and I got very good compression rate. Thanks :-). But for the small string as in original question compression makes things worse :-)..still wondering how 'System.IO.Compression.GZipStream' compression for the same small string compresses better? – KBNanda May 25 '21 at 15:27
  • 1
    Also, you're converting to base64. That massively increases the size of any compressed result. – canton7 May 25 '21 at 15:33

1 Answers1

2

Your benchmark has some fairly fundamental problems:

  1. You're using UTF-16 to encode the input string to bytes when calculating its length (UTF8Encoding.Unicode is just an unclear way of writing Encoding.Unicode, which is UTF-16). That encodes to 2 bytes per character, but most of those bytes will be 0.
  2. You're base64-encoding your output. While this is a way to print arbitrary binary data as text, it uses 4 characters to represent 3 bytes of data, so you're increasing the size of your output by 33%.
  3. You're then using UTF-16 to turn the base64-encoded string into bytes again, which takes 2 bytes per character again. So that's an artificial 2x added to your result...

It so happens that the two uses of UTF-16 more-or-less cancel out, but the base64-encoding bit is still responsible for a lot of the discrepancies you're seeing.

Take that out, and you get a compression ratio of: 0.80338985.

That's not bad, given that compression introduces overheads: there's data which always needs to appear in a GZip stream, and it's there regardless of how well your data compresses. You can only really expect compression to make any significant difference on larger inputs.

See here.

canton7
  • 37,633
  • 3
  • 64
  • 77
  • 3
    Yep. I'm seeing 0.79 with gzip. A key point however is that this input is too small to expect any significant compression. The OP should not be trying to compress individual short strings, but rather sequences of such strings for much better compression. – Mark Adler May 25 '21 at 15:46
  • Yes. I got 0.76 with the changes done as suggested (my input string had slight changes than the original I shared here). Just for example I used small string. In practical it's a JArray with multiple elements like that. – KBNanda May 25 '21 at 16:01