0

I have a requirement to archive all the data used to build a report everyday. I compress most of the the data using gzip, as some of the datasets can be very large (10mb+). I write each individual protobuf graph to a file. I also whitelist a fixed set of known small object types and added some code to detect if the file is gzipped or not, when I read it. This is because a small file, when compressed can actually be bigger then uncompressed.

Unfortunately, just due to the nature of the data, I may only have a few elements of a larger object type, and the whitelist approach can be problematic.

Is there anyway to write an object to a stream, and only if it reaches a threshold (like 8kb), then compress it? I don't know the size of the object beforehand, and sometimes I have an object graph with an IEnumerable<T> that might be considerable in size.

Edit: The code is fairly basic. I did skim over the fact that I store this in a filestream db table. That shouldn't really matter for the implementation purpose. I removed some of the extraneous code.

public Task SerializeModel<T>(TransactionalDbContext dbConn, T Item, DateTime archiveDate, string name)
{
    var continuation = (await dbConn
        .QueryAsync<PathAndContext>(_getPathAndContext, new {archiveDate, model=name})
        .ConfigureAwait(false))
        .First();

    var useGzip = !_whitelist.Contains(typeof(T));

    using (var fs = new SqlFileStream(continuation.Path, continuation.Context, FileAccess.Write,
        FileOptions.SequentialScan | FileOptions.Asynchronous, 64*1024))
    using (var buffer = useGzip ? new GZipStream(fs, CompressionLevel.Optimal) : default(Stream))
    {
        _serializerModel.Serialize(stream ?? fs, item);
    }

    dbConn.Commit();
}
Michael B
  • 7,512
  • 3
  • 31
  • 57
  • Try to read 8k, if there is less data, output uncompressed. Otherwise, gzip the whole stream including the initial 8k? What's your specific problem? – Niklas B. Oct 26 '15 at 15:04
  • The problem is I only have `T model` and I'm passing it to protobuf `.SerializeWithLengthPrefix`, if the object is 1 byte or 100mb I have no idea. – Michael B Oct 26 '15 at 15:09
  • Well you plug in an output stream. That stream instance has to buffer 8k of data and then decide what to do. It can then pass on the decompressed or compressed data to another stream. – Niklas B. Oct 26 '15 at 15:34
  • @MichaelB Assume you can do that. How are you going to handle deserialization w/o knowing if the stream is zipped or not? – Ivan Stoev Oct 26 '15 at 17:13
  • That's the easy part. Gzip has a header that is easily detected. – Michael B Oct 26 '15 at 18:43

2 Answers2

1

During the serialization, you can use an intermediate stream to accomplish what you are asking for. Something like this will do the job

class SerializationOutputStream : Stream
{
    Stream outputStream, writeStream;
    byte[] buffer;
    int bufferedCount;
    long position;
    public SerializationOutputStream(Stream outputStream, int compressTreshold = 8 * 1024)
    {
        writeStream = this.outputStream = outputStream;
        buffer = new byte[compressTreshold];
    }
    public override long Seek(long offset, SeekOrigin origin) { throw new NotSupportedException(); }
    public override void SetLength(long value) { throw new NotSupportedException(); }
    public override int Read(byte[] buffer, int offset, int count) { throw new NotSupportedException(); }
    public override bool CanRead { get { return false; } }
    public override bool CanSeek { get { return false; } }
    public override bool CanWrite { get { return writeStream != null &&  writeStream.CanWrite; } }
    public override long Length { get { throw new NotSupportedException(); } }
    public override long Position { get { return position; } set { throw new NotSupportedException(); } }
    public override void Write(byte[] buffer, int offset, int count)
    {
        if (count <= 0) return;
        var newPosition = position + count;
        if (this.buffer == null)
            writeStream.Write(buffer, offset, count);
        else
        {
            int bufferCount = Math.Min(count, this.buffer.Length - bufferedCount);
            if (bufferCount > 0)
            {
                Array.Copy(buffer, offset, this.buffer, bufferedCount, bufferCount);
                bufferedCount += bufferCount;
            }
            int remainingCount = count - bufferCount;
            if (remainingCount > 0)
            {
                writeStream = new GZipStream(outputStream, CompressionLevel.Optimal);
                try
                {
                    writeStream.Write(this.buffer, 0, this.buffer.Length);                            
                    writeStream.Write(buffer, offset + bufferCount, remainingCount);
                }
                finally { this.buffer = null; }
            }
        }
        position = newPosition;
    }
    public override void Flush()
    {
        if (buffer == null)
            writeStream.Flush();
        else if (bufferedCount > 0)
        {
            try { outputStream.Write(buffer, 0, bufferedCount); }
            finally { buffer = null; }
        }
    }
    protected override void Dispose(bool disposing)
    {
        try
        {
            if (!disposing || writeStream == null) return;
            try { Flush(); }
            finally { writeStream.Close(); }
        }
        finally
        {
            writeStream = outputStream = null;
            buffer = null;
            base.Dispose(disposing);
        }
    }
}

and use it like this

using (var stream = new SerializationOutputStream(new SqlFileStream(continuation.Path, continuation.Context, FileAccess.Write,
        FileOptions.SequentialScan | FileOptions.Asynchronous, 64*1024)))
    _serializerModel.Serialize(stream, item);
Ivan Stoev
  • 195,425
  • 15
  • 312
  • 343
0

datasets can be very large (10mb+)

On most devices, that is not very large. Is there a reason you can't read in the entire object before deciding whether to compress? Note also the suggestion from @Niklas to read in one buffer's worth of data (e.g. 8K) before deciding whether to compress.

This is because a small file, when compressed can actually be bigger then uncompressed.

The thing that makes a small file potentially larger is the ZIP header, in particular the dictionary. Some ZIP libraries allow you to use a custom dictionary known while compressing and uncompressing. I used SharpZipLib for this many years back.

It is more effort, in terms of coding and testing, to use this approach. If you feel that the benefit is worthwhile, it may provide the best approach.

Note no matter what path you take, you will physically store data using multiples of the block size of your storage device.

if the object is 1 byte or 100mb I have no idea

Note that protocol buffers is not really designed for large data sets

Protocol Buffers are not designed to handle large messages. As a general rule of thumb, if you are dealing in messages larger than a megabyte each, it may be time to consider an alternate strategy.

That said, Protocol Buffers are great for handling individual messages within a large data set. Usually, large data sets are really just a collection of small pieces, where each small piece may be a structured piece of data.

If your largest object can comfortably serialize into memory, first serialize it into a MemoryStream, then either write that MemoryStream to your final destination, or run it through a GZipStream and then to your final destination. If the largest object cannot comfortably serialize into memory, I'm not sure what further advice to give.

Community
  • 1
  • 1
Eric J.
  • 147,927
  • 63
  • 340
  • 553
  • Unfortunately, I don't always have an easy way to read 8kb of data or anything. I just a `T model` as the entry point. In general, I want to avoid reading entire object mostly because I'm not sure how to. 10mb was a random number, some are actually about 1gb, but anything above 10mb can lead slow down to the process from gc related pressure. I'm not sending the files anywhere, I'm just using them as an archive. For this purpose they work better than binarywriter. – Michael B Oct 26 '15 at 15:23
  • Please show the code that you are currently using to serialize the object. – Eric J. Oct 26 '15 at 15:24
  • Try reading e.g. 8K from `fs` into a MemoryStream. If you get 8K, you know you want to use a GZipStream. Rewind your stream if it supports rewinding, else dispose of it and create a new SqlFileStream and then use the rewound or new stream for deserialization. – Eric J. Oct 26 '15 at 15:57
  • Serialize will write everything though (be it 1 byte or 100mb). I think I can specify the attribute to support rewind. But I'd prefer not to have to write the entire file and then rewind and do it again. – Michael B Oct 26 '15 at 16:04
  • I'm not saying to write the entire file. In the first step, try to copy 8K from the SqlFileStream to a MemoryStream. Don't deserialize in that first step. – Eric J. Oct 26 '15 at 16:22
  • I guess I'm confused. How do I prevent protobuf from writing the entire object graph? – Michael B Oct 26 '15 at 16:43
  • Added a code example. Note you may be able to rewind the stream rather than disposing of the "probe" stream and instantiating a new one. – Eric J. Oct 26 '15 at 16:51
  • The filestream is empty before I write anything to it. The `.Serialize` call does the writing. fs is the target destination not what i'm reading from. `T Item` is the object that can be one byte or 100 mb. – Michael B Oct 26 '15 at 16:58
  • Oh... sorry misread that part of the code. Thought you were copying the object from SQL. – Eric J. Oct 26 '15 at 17:35
  • If an individual object will comfortably fit in memory, you can entirely deserialize it to a MemoryStream, and then use that MemoryStream as a source to serialize either with or without the GZipStream. If you cannot guarantee that your largest objects can comfortably be deserialized into memory, I'm not sure what further advice to give. – Eric J. Oct 26 '15 at 17:39