3

I want to use protobuf-net to serialize stock market data. I'm playing around with following message model:

1st message: Meta Data describing what data to expect and some other info.
2nd message: DataBegin
3rd message: DataItem
4th message: DataItem
...
nth message: EndData

Here's an example of a Data Item:

class Bar{
   DateTime DateTime{get;set;}
   float Open{get;set}
   float High{get;set}
   float Low{get;set}
   float Close{get;set}
   intVolume{get;set}
 }

Right now I'm using TypeModel.SerializeWithLengthPrefix(...) to serialize each message (TypeModel is compiled). Which works great, but it's about 10x slower than serializing each message manually using a BinaryWriter. What matters here of course is not the meta data but the serialization of each DataItem. I have a lot of data and in some cases it's read/written to a file and there performance is crucial.

What would be a good way of increasing the performance of the serialization and deserialization of each DataItem?

Should I use ProtoWriter directly here? If yes how would I do this (i'm a bit new to Protocol Buffers).

lukebuehler
  • 4,061
  • 2
  • 24
  • 28
  • The *fastest* way to do that is to write the sequence as one single message using "group"s for the sub-messages. If I try a few things, how many DataItem values are we typically talking about? (just so I can make a realistic case). – Marc Gravell Nov 30 '11 at 18:25
  • Between 200 and 100,000 items. But let's say 100'000 messages for now. But then there could be up to 6000 data streams with 100'000 messages each. This is mostly for back testing so it's about how quickly I can write the messages to files and then load it. When using this in real-time the performance is not as crucial. – lukebuehler Nov 30 '11 at 19:10
  • It seems to me that I cannot completely write a stream manually to "mock" SerializeWithLengthPrefix. How would I write the 0A (key is 1) prefix at the start of the binary stream using the ProtoWriter? – lukebuehler Nov 30 '11 at 20:56
  • honestly, don't go ProtoWriter - that won't be the key difference (that is what `CompileInPlace` does already) – Marc Gravell Nov 30 '11 at 21:32

1 Answers1

3

Yes, if your data is a very simple set of homogeneous records, with no additional requirements (for example, it doesn't need to be forwards compatible or version elegantly, or be usable from clients that don't fully know all the data), doesn't need to be conveniently portable, and you don't mind implementing all the serialization manually, then yes: you can do it more efficiently manually. In a quick test:

protobuf-net serialize: 55ms, 3581680 bytes
protobuf-net deserialize: 65ms, 100000 items
BinaryFormatter serialize: 443ms, 4200629 bytes
BinaryFormatter deserialize: 745ms, 100000 items
manual serialize: 26ms, 2800004 bytes
manual deserialize: 32ms, 100000 items

The extra space is presumably the field markers (which you don't need if you are packing the records manually and don't need to worry about different versions of the API in use at the same time).

I certainly don't reproduce "10x"; I get 2x, which isn't bad considering the things that protobuf offers. And is certainly a lot better than BinaryFormatter, which is more like 20x! Here's some of the features:

  • version tolerance
  • portability
  • schema usage
  • no manual code
  • inbuilt support for sub-objects and collections
  • support for omitting default values
  • support for common .NET scenarios (serialization callbacks; conditional serialization patterns, etc)
  • inheritance (protobuf-net only; not part of the standard protobuf spec)

It sounds like in your scenario manual serialization is the thing to do; that's fine - I'm not offended ;p the purpose of a serialization library is to address the more general problem in a way that doesn't need manual code writing.

My test rig:

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.IO;
using System.Linq;
using ProtoBuf;
using ProtoBuf.Meta;
using System.Runtime.Serialization.Formatters.Binary;

public static class Program
{
    static void Main()
    {

        var model = RuntimeTypeModel.Create();
        model.Add(typeof(BarWrapper), true);
        model.Add(typeof(Bar), true);
        model.CompileInPlace();

        var data = CreateBar(100000).ToList();
        RunTest(model, data);

    }

    private static void RunTest(RuntimeTypeModel model, List<Bar> data)
    {
        using(var ms = new MemoryStream())
        {
            var watch = Stopwatch.StartNew();
            model.Serialize(ms, new BarWrapper {Bars = data});
            watch.Stop();
            Console.WriteLine("protobuf-net serialize: {0}ms, {1} bytes", watch.ElapsedMilliseconds, ms.Length);

            ms.Position = 0;
            watch = Stopwatch.StartNew();
            var bars = ((BarWrapper) model.Deserialize(ms, null, typeof (BarWrapper))).Bars;
            watch.Stop();
            Console.WriteLine("protobuf-net deserialize: {0}ms, {1} items", watch.ElapsedMilliseconds, bars.Count);
        }
        using (var ms = new MemoryStream())
        {
            var bf = new BinaryFormatter();
            var watch = Stopwatch.StartNew();
            bf.Serialize(ms, new BarWrapper { Bars = data });
            watch.Stop();
            Console.WriteLine("BinaryFormatter serialize: {0}ms, {1} bytes", watch.ElapsedMilliseconds, ms.Length);

            ms.Position = 0;
            watch = Stopwatch.StartNew();
            var bars = ((BarWrapper)bf.Deserialize(ms)).Bars;
            watch.Stop();
            Console.WriteLine("BinaryFormatter deserialize: {0}ms, {1} items", watch.ElapsedMilliseconds, bars.Count);
        }
        byte[] raw;
        using (var ms = new MemoryStream())
        {
            var watch = Stopwatch.StartNew();
            WriteBars(ms, data);
            watch.Stop();
            raw = ms.ToArray();
            Console.WriteLine("manual serialize: {0}ms, {1} bytes", watch.ElapsedMilliseconds, raw.Length);
        }
        using(var ms = new MemoryStream(raw))
        {
            var watch = Stopwatch.StartNew();
            var bars = ReadBars(ms);
            watch.Stop();
            Console.WriteLine("manual deserialize: {0}ms, {1} items", watch.ElapsedMilliseconds, bars.Count);            
        }

    }
    static IList<Bar> ReadBars(Stream stream)
    {
        using(var reader = new BinaryReader(stream))
        {
            int count = reader.ReadInt32();
            var bars = new List<Bar>(count);
            while(count-- > 0)
            {
                var bar = new Bar();
                bar.DateTime = DateTime.FromBinary(reader.ReadInt64());
                bar.Open = reader.ReadInt32();
                bar.High = reader.ReadInt32();
                bar.Low = reader.ReadInt32();
                bar.Close = reader.ReadInt32();
                bar.Volume = reader.ReadInt32();
                bars.Add(bar);
            }
            return bars;
        }
    }
    static void WriteBars(Stream stream, IList<Bar> bars )
    {
        using(var writer = new BinaryWriter(stream))
        {
            writer.Write(bars.Count);
            foreach (var bar in bars)
            {
                writer.Write(bar.DateTime.ToBinary());
                writer.Write(bar.Open);
                writer.Write(bar.High);
                writer.Write(bar.Low);
                writer.Write(bar.Close);
                writer.Write(bar.Volume);
            }
        }

    }
    static IEnumerable<Bar> CreateBar(int count)
    {
        var rand = new Random(12345);
        while(count-- > 0)
        {
            var bar = new Bar();
            bar.DateTime = new DateTime(
                rand.Next(2008,2011), rand.Next(1,13), rand.Next(1, 29),
                rand.Next(0,24), rand.Next(0,60), rand.Next(0,60));
            bar.Open = (float) rand.NextDouble();
            bar.High = (float)rand.NextDouble();
            bar.Low = (float)rand.NextDouble();
            bar.Close = (float)rand.NextDouble();
            bar.Volume = rand.Next(-50000, 50000);
            yield return bar;
        }
    }

}
[ProtoContract]
[Serializable] // just for BinaryFormatter test
public class BarWrapper
{
    [ProtoMember(1, DataFormat = DataFormat.Group)]
    public List<Bar> Bars { get; set; } 
}
[ProtoContract]
[Serializable] // just for BinaryFormatter test
public class Bar
{
    [ProtoMember(1)]
    public DateTime DateTime { get; set; }

    [ProtoMember(2)]
    public float Open { get; set; }

    [ProtoMember(3)]
    public float High { get; set; }

    [ProtoMember(4)]
    public float Low { get; set; }

    [ProtoMember(5)]
    public float Close { get; set; }

    // use zigzag if it can be -ve/+ve, or default if non-negative only
    [ProtoMember(6, DataFormat = DataFormat.ZigZag)]
    public int Volume { get; set; }
}
Marc Gravell
  • 1,026,079
  • 266
  • 2,566
  • 2,900
  • Wow, thanks for that detailed answer. To explain a little more how I want to use protocol buffers. I want to use the "normal" way of se(de)rializing almost all messages by using typeModel.Serialize or whatever. Mostly because of all the benefits you just listed. BUT in some cases for certain datatypes (like the Bars) id like to switch to a manual serialization mode, but I want the data to still be binary compatible, e.g. serialize using TypeModel deserialize manually. The extra work for those few types where I'd do that is okey if I squeeze some speed out of it. – lukebuehler Nov 30 '11 at 21:40
  • there is very little point in deserializing manually... you will just be duplicating what Compile() already does... the only way to make it tighter is not use the protobuf wire format (as defined by google). You *could* use a few `byte[]` BLOBs in places? Other than that... or am I missing the point of what you are trying to do? – Marc Gravell Nov 30 '11 at 22:09
  • Yeah you're right, I'm seeing the point now as well. With some tweaks here and there I'm now at 3x slower than manual packed serialization. But I see that it wont get much closer than that (just having to simply write each header). Thanks for your help! I really love Protobuf-net! I'll have to see if x2-x3 is acceptable or not. But it's surely the fastest serializer I've looked into so far! – lukebuehler Nov 30 '11 at 22:18
  • @lukebuehler note that normally io/bandwidth is the limiting factor. Writing to memorystream may impact the numbers a little. Try it do disk too. Then add an SSD :p note the other tricks I applied there: an outermost wrapper object, and the "group"-encoding for the list items – Marc Gravell Nov 30 '11 at 22:38