Let me explain: in my use case a system gives me many strings that can vary in size (number of characters; length), sometimes it can be really huge! The problem is that I have to save this string in a column of a table of a "SQL Server" database, the bad news is that I am not allowed to do any migration in this database, the good news is that the column already has type nvarchar(max)
.
I've done some research before and followed the following post to write a data compressor using "Gzip" and "Brotli".
https://khalidabuhakmeh.com/compress-strings-with-dotnet-and-csharp
var value = "hello world";
var level = CompressionLevel.SmallestSize;
var bytes = Encoding.Unicode.GetBytes(value);
await using var input = new MemoryStream(bytes);
await using var output = new MemoryStream();
// GZipStream with BrotliStream
await using var stream = new GZipStream(output, level);
await input.CopyToAsync(stream);
var result = output.ToArray();
var resultString = Convert.ToBase64String(result);
After implementing the conversion methods, I created tests that generate random strings of varying sizes (length) to validate the performance of the compressors, at this point I noticed the following. Both "Gzip" and "Brotli" first convert to byte[] (byte array), then apply compression, which gives a result vector (byte array) of reduced size as expected, but then converts the result (byte[]) to a base 64 string that, in 100% of the tests it has more characters (length) than the initial string.
My random string generator:
var rd_char = new Random();
var rd_length = new Random();
var wordLength = rd_length.Next(randomWordParameters.WordMinLength, randomWordParameters.WordMaxLength);
var sb = new StringBuilder();
int sourceNumber;
for (int i = 0; i < wordLength; i++)
{
sourceNumber = rd_char.Next(randomWordParameters.CharLowerBound, randomWordParameters.CharUpperBound);
sb.Append(Convert.ToChar(sourceNumber));
}
var word = sb.ToString();
My sample strings don't exactly contain perfect representations of the cases at hand, but I believe they are good enough. Here's the string generator method, in fact it generates completely random strings in a given size range, and I used in the tests characters provided from the 33 ~ 127 values passed to the Convert.ToChar() method. The strings provided by the system are in JSON format, in practice they are lists of URLs (with tens of thousands of urls), the urls usually have random sequences of characters, so I tried to generate strings as random as possible.
The fact is that, considering the case where I try to save in the database a string that originally (before compression) is larger than the maximum size (length) allowed in the column, when saving in the database, the "data" that goes to the column of the table in question is the result "base 64" string generated after compression and not the reduced size vector (byte array), and I believe the database will refuse the string (base 64) since its length (in the number of characters) is greater than the length of the original string.
So here's my question, is there any (invertible) way to convert a string into a smaller one, and when I say smaller I mean "with reduced length"? It seems that "Gzip" or "Brotli" doesn't solve the problem.
PS: I made sure to highlight the term "length" several times to make it clear that at that point in the text I'm talking about the number of characters and not the length in memory, because I noticed in several forums that I read before, that this confusion was making it difficult to reach conclusions.