6

I'm currently developing an application in C# that uses Amazon SQS The size limit for a message is 8kb.

I have a method that is something like:

public void QueueMessage(string message)

Within this method, I'd like to first of all, compress the message (most messages are passed in as json, so are already fairly small)

If the compressed string is still larger than 8kb, I'll store it in S3.

My question is:

How can I easily test the size of a string, and what's the best way to compress it? I'm not looking for massive reductions in size, just something nice and easy - and easy to decompress the other end.

Alex
  • 37,502
  • 51
  • 204
  • 332

2 Answers2

12

To know the "size" (in kb) of a string we need to know the encoding. If we assume UTF8, then it is (not including BOM etc) like below (but swap the encoding if it isn't UTF8):

int len = Encoding.UTF8.GetByteCount(longString);

Re packing it; I would suggest GZIP via UTF8, optionally followed by base-64 if it has to be a string:

    using (MemoryStream ms = new MemoryStream())
    {
        using (GZipStream gzip = new GZipStream(ms, CompressionMode.Compress, true))
        {
            byte[] raw = Encoding.UTF8.GetBytes(longString);
            gzip.Write(raw, 0, raw.Length);
            gzip.Close();
        }
        byte[] zipped = ms.ToArray(); // as a BLOB
        string base64 = Convert.ToBase64String(zipped); // as a string
        // store zipped or base64
    }
Marc Gravell
  • 1,026,079
  • 266
  • 2,566
  • 2,900
  • Thanks. How do i determine the encoding? I haven't set this anywhere... i just serialize an object to json (using the json.net lib) – Alex May 04 '10 at 11:42
  • Question: is the `gzip.Close()` call necessary, considering exiting the `using` block should close it anyway? – tzaman May 04 '10 at 11:45
  • @alex: You'd chose the encoding yourself when serializing the string to binary. As Marc says, UTF-8 is the best choice for size, since most characters occupy only one byte in this encoding. – Will Vousden May 04 '10 at 11:47
  • @tzaman - to be honest, not sure; but I *do* know that `GZipStream` keeps a buffer even if you `Flush()`, so it must be closed. The `using` may indeed suffice, so maybe I'm being explicit unnecessarily. – Marc Gravell May 04 '10 at 11:52
  • @Will - well, *generally* it is; there are some i18n occasions where UTF8 will be more expensive. But it is a reasonable default. – Marc Gravell May 04 '10 at 11:53
  • @alex - an encoding is the map between character data and bytes; this *might* be listed in the SQS/S3 documentation? – Marc Gravell May 04 '10 at 11:54
1

Give unzip bytes to this function.The best I could come up with was

public static byte[] ZipToUnzipBytes(byte[] bytesContext)
        {
            byte[] arrUnZipFile = null;
            if (bytesContext.Length > 100)
            {
                using (var inFile = new MemoryStream(bytesContext))
                {
                    using (var decompress = new GZipStream(inFile, CompressionMode.Decompress, false))
                    {
                        byte[] bufferWrite = new byte[4];
                        inFile.Position = (int)inFile.Length - 4;
                        inFile.Read(bufferWrite, 0, 4);
                        inFile.Position = 0;
                        arrUnZipFile = new byte[BitConverter.ToInt32(bufferWrite, 0) + 100];
                        decompress.Read(arrUnZipFile, 0, arrUnZipFile.Length);
                    }
                }
            }
            return arrUnZipFile;
        }