4

Let me explain: in my use case a system gives me many strings that can vary in size (number of characters; length), sometimes it can be really huge! The problem is that I have to save this string in a column of a table of a "SQL Server" database, the bad news is that I am not allowed to do any migration in this database, the good news is that the column already has type nvarchar(max).

I've done some research before and followed the following post to write a data compressor using "Gzip" and "Brotli".

https://khalidabuhakmeh.com/compress-strings-with-dotnet-and-csharp

var value = "hello world";
var level = CompressionLevel.SmallestSize;

var bytes = Encoding.Unicode.GetBytes(value);
await using var input = new MemoryStream(bytes);
await using var output = new MemoryStream();

// GZipStream with BrotliStream
await using var stream = new GZipStream(output, level);

await input.CopyToAsync(stream);

var result = output.ToArray();
var resultString = Convert.ToBase64String(result);

After implementing the conversion methods, I created tests that generate random strings of varying sizes (length) to validate the performance of the compressors, at this point I noticed the following. Both "Gzip" and "Brotli" first convert to byte[] (byte array), then apply compression, which gives a result vector (byte array) of reduced size as expected, but then converts the result (byte[]) to a base 64 string that, in 100% of the tests it has more characters (length) than the initial string.

My random string generator:

var rd_char = new Random();
var rd_length = new Random();
var wordLength = rd_length.Next(randomWordParameters.WordMinLength, randomWordParameters.WordMaxLength);
var sb = new StringBuilder();
int sourceNumber;
for (int i = 0; i < wordLength; i++)
{
    sourceNumber = rd_char.Next(randomWordParameters.CharLowerBound, randomWordParameters.CharUpperBound);
    sb.Append(Convert.ToChar(sourceNumber));
}
var word = sb.ToString();

My sample strings don't exactly contain perfect representations of the cases at hand, but I believe they are good enough. Here's the string generator method, in fact it generates completely random strings in a given size range, and I used in the tests characters provided from the 33 ~ 127 values ​​passed to the Convert.ToChar() method. The strings provided by the system are in JSON format, in practice they are lists of URLs (with tens of thousands of urls), the urls usually have random sequences of characters, so I tried to generate strings as random as possible.

The fact is that, considering the case where I try to save in the database a string that originally (before compression) is larger than the maximum size (length) allowed in the column, when saving in the database, the "data" that goes to the column of the table in question is the result "base 64" string generated after compression and not the reduced size vector (byte array), and I believe the database will refuse the string (base 64) since its length (in the number of characters) is greater than the length of the original string.

So here's my question, is there any (invertible) way to convert a string into a smaller one, and when I say smaller I mean "with reduced length"? It seems that "Gzip" or "Brotli" doesn't solve the problem.

PS: I made sure to highlight the term "length" several times to make it clear that at that point in the text I'm talking about the number of characters and not the length in memory, because I noticed in several forums that I read before, that this confusion was making it difficult to reach conclusions.

1o0oP
  • 75
  • 5
  • Well trivially, you can observe that the code you linked is first converting to a unicode array using `Encoding.Unicode.GetBytes(value);`. If most of the characters are ASCII, then it's likely to be better to use `Encoding.UTF8.GetBytes()` (and the corresponding method when decoding of course). – Matthew Watson Apr 15 '22 at 14:01
  • After you mention this I tested it with "Encoding.UTF8.GetBytes(value)". Still bringing a string larger than the original one in length (in number of characters). This is my test log. (I used a string with 10k characters.) SizeVariation = 10000 byte(s) -> 8263 byte(s) LengthVariation = 10000 char(s) -> 11020 char(s) DistanceInSize = -1737 bytes ReductionInSize = +17,37% (decreased in bytes) DistanceInLength = +1020 char(s) ReductionInLength = -10,2% (increased in length) – 1o0oP Apr 15 '22 at 16:18
  • 2
    As strings can contain any characters it is only possible to compress strings if you restrict the available characters for all affected strings you want to compress beforehand. E.g. you could compress "AAAAABB" to "A5B2" but this does not work on source strings which already contain numbers and so on. Therefore the compression can result in a longer string than the source. Furthermore, you could store the reduced byte array (by Gzip/Brotli) as hexadecimal string but I guess it would also be longer than the source for the reasons mentioned. – tar Apr 15 '22 at 17:06
  • 1
    *"a system gives me many strings [...] I created tests that generate random strings"* -- Are the random strings that you create good representations of the actual string that you get? Could you edit the question and show us one of each as an example? – Theodor Zoulias Apr 15 '22 at 17:24
  • 2
    If the data is truly random, other than the limit on the range of byte values, then the best lossless reduction you'll be able to get is about 18%. Hardly seems worth it. – Mark Adler Apr 15 '22 at 19:56
  • My sample strings don't exactly contain perfect representations of the cases at hand, but I believe they are good enough. I edited and added the string generator method, in fact it generates completely random strings in a given size range, and I used in the tests characters provided from the 33 ~ 127 values ​​passed to the Convert.ToChar() method. The strings provided by the system are in JSON format, in practice they are lists of URLs (with tens of thousands of urls), the urls usually have random sequences of characters, so I tried to generate strings as random as possible. – 1o0oP Apr 15 '22 at 20:05

2 Answers2

3

The max size for a column of type NVARCHAR(MAX) is 2 GByte of storage.

Since NVARCHAR uses 2 bytes per character, that's approx. 1 billion characters.

So I don't think you actually need to make a compression, if the problem is the performance when retrieving data, then you can use a server side caching system.

Kvble
  • 286
  • 1
  • 8
  • Good point. But if anyone else knows of any way to compress a string (in number of characters) I'd still love to know, so I'll keep the question open. – 1o0oP Apr 15 '22 at 16:08
3

The compression algorithms are exploiting repetitive patterns in the input stream. There is no much repetition in a typical URL, so compressing a single URL is unlikely to yield a representation that is much shorter than the original. In case the URL has no repetitive patterns at all (if it's close to a random string), the compression algorithm is going to yield a larger output than the input.

Below is a demonstration of this behavior, using the Encoding.UTF8 to convert the URLs to bytes, and the Encoding.Latin1 to convert the compressed bytes to a string:

static string Compress(string value)
{
    byte[] bytes = Encoding.UTF8.GetBytes(value);
    using var input = new MemoryStream(bytes);
    using var output = new MemoryStream();
    using (var gz = new GZipStream(output, CompressionLevel.SmallestSize))
        input.CopyTo(gz);
    byte[] result = output.ToArray();
    return Encoding.Latin1.GetString(result);
}

static string Decompress(string compressedValue)
{
    byte[] bytes = Encoding.Latin1.GetBytes(compressedValue);
    using var input = new MemoryStream(bytes);
    using var output = new MemoryStream();
    using (var gz = new GZipStream(input, CompressionMode.Decompress))
        gz.CopyTo(output);
    byte[] result = output.ToArray();
    return Encoding.UTF8.GetString(result);
}

I used three fairly long and non-repetitive URLs for the test:

string[] urls = new string[]
{
    "https://stackoverflow.com/questions/71884821/is-there-any-invertible-way-in-c-to-convert-a-string-into-a-smaller-one-an#comment127033258_71884821",
    "https://github.com/dotnet/runtime/blob/2d4f2d0c8f60d5f49e39f3ddbe1824648ee2b306/src/libraries/System.Private.CoreLib/src/System/Text/Encoding.cs#L77",
    "https://sharplab.io/#v2:CYLg1APgAgTAjAWAFBQMwAJabgdmQb2XWMwygBZ0BZAQwEsA7ACgEoiTCkTvsBOJgEQAJAKYAbMQHt0Ad0kAnMcAEsA3O2IBfZJqA===",
};
foreach (var original in urls)
{
    Console.WriteLine($"Original:     {original.Length} chars, {original.Substring(0, 50)}...");
    var compressed = Compress(original);
    double compression = (original.Length - compressed.Length) / (double)original.Length;
    Console.WriteLine($"Compressed:   {compressed.Length} chars, compression: {compression:0.00%}");
    var decompressed = Decompress(compressed);
    Console.WriteLine($"Decompressed: {decompressed.Length} chars");
    Console.WriteLine($"Successful:   {decompressed == original}");
    Console.WriteLine();
}

Output:

Original:     145 chars, https://stackoverflow.com/questions/71884821/is-th...
Compressed:   133 chars, compression: 8.28%
Decompressed: 145 chars
Successful:   True

Original:     148 chars, https://github.com/dotnet/runtime/blob/2d4f2d0c8f6...
Compressed:   143 chars, compression: 3.38%
Decompressed: 148 chars
Successful:   True

Original:     128 chars, https://sharplab.io/#v2:CYLg1APgAgTAjAWAFBQMwAJabg...
Compressed:   141 chars, compression: -10.16%
Decompressed: 128 chars
Successful:   True

Try it on Fiddle.

Two of the three URLs became slightly shorter after the compression, but the third URL was bloated instead.

You could store in the database either the compressed or the original value, depending on which is the shorter one. You could prefix the stored value with some marker, for example 'C' or 'U', so that you know if it's compressed or uncompressed.

Theodor Zoulias
  • 34,835
  • 7
  • 69
  • 104
  • 1
    Ok, getting a 3% gain on a list with tens of thousands of urls is already a good gain! A pity that the percentages are low due to the random nature of the urls, anyway it will help a lot, thank you all very much for your support! – 1o0oP Apr 16 '22 at 19:56
  • 2
    @1o0oP actually I am not so enthusiastic for a gain of 3%, than could probably be achieved by just removing the initial "https://" from the URLs! Not to mention that by compressing the data in the database, you lose the ability of doing direct SQL queries on those data with the tools provided by the SQL Server. – Theodor Zoulias Apr 16 '22 at 20:05
  • 2
    The basic theory of compressing many small strings is to *first* take a *large sample* of expected strings. Use those samples to build a compression dictionary. Then "freeze" that dictionary - save it somewhere. After that, each individual string can be compressed using the same dictionary. I don't know if GZip has an option that could take advantage of such a pre-built dictionary? This technique might be useful even for "random" strings - if the "random" characters don't use all possible byte values. – ToolmakerSteve Apr 18 '22 at 02:24