Why is protobuf-net deserializer so much slower in my code than streamreading csv

Question

I store simple time series in the following format and look for the fastest way to read and parse them to "quote" objects:

DateTime, price1, price2 . . . DateTime is in the following string format: YYYYmmdd HH:mm:ss:fff price1 and price 2 are strings of numbers with 5 decimal places (1.40505, i.e.)

I played with different ways to store and read the data and also toyed around with the protobuf-net library. A file that was serialized and contained roughly 6 million rows (raw csv serialized in the following way:

TimeSeries object, holding a List<Blobs>, Blob object holding a Header object and List<Quotes> (one blob contains quotes for one single day) Quote object holding DateTime, double px1, and double px2

It took about 47 seconds to read (from disk) the serialized binary and deserialize it which seemed awefully long. In contrast I kept the time series in csv string format, read each row into a List and then parsed each row to DateTime dt, double px1, double px1 which I stuck into a newly created Quote object and added those to a List. This took about 10 seconds to read (12 seconds with GZip compression -> making the file about 1/9th of the size.)

At first sight it looks like I either handle protobuf-net functionality incorrectly or else that this particular kind of time series does not lend itself well to serialization/deserialization.

Any comments or help, especially Marc, if you read this, could you possibly chime in and add some of your thoughts? I find it hard to imagine that I end up with such different performance numbers.

Some information: I do not need to random access the data. I only need to read full days, thus storing one day's worth of data in an individual csv file made sense for my purpose, I thought.

Any ideas what may be the fastest way to read such kind of data? I apologize for the simplistic language, I am not a programmer by heart.

Here is a sample object I use for protobuf-net:

[ProtoContract]
class TimeSeries
{
    [ProtoMember(1)]
    public Header Header { get; set; }
    [ProtoMember(2)]
    public List<DataBlob> DataBlobs { get; set; }
}

[ProtoContract]
class DataBlob
{
    [ProtoMember(1)]
    public Header Header { get; set; }
    [ProtoMember(2)]
    public List<Quote> Quotes { get; set; }
}

[ProtoContract]
class Header
{
    [ProtoMember(1)]
    public string SymbolID { get; set; }
    [ProtoMember(2)]
    public DateTime StartDateTime { get; set; }
    [ProtoMember(3)]
    public DateTime EndDateTime { get; set; }
}

[ProtoContract]
class Quote
{
    [ProtoMember(1)]
    public DateTime DateTime { get; set; }
    [ProtoMember(2)]
    public double BidPrice { get; set; }
    [ProtoMember(3)]
    public long AskPrice { get; set; } //Expressed as Spread to BidPrice
}

Here is the code used to serialize/deserialize:

public static void SerializeAll(string fileNameWrite, List<Quote> QuoteList)
    {
        //Header
        Header Header = new Header();
        Header.SymbolID = SymbolID;
        Header.StartDateTime = StartDateTime;
        Header.EndDateTime = EndDateTime;

        //Blob
        List<DataBlob> DataBlobs = new List<DataBlob>();
        DataBlob DataBlob = new DataBlob();
        DataBlob.Header = Header;
        DataBlob.Quotes = QuoteList;
        DataBlobs.Add(DataBlob);

        //Create TimeSeries
        TimeSeries TimeSeries = new TimeSeries();
        TimeSeries.Header = Header;
        TimeSeries.DataBlobs = DataBlobs;

        using (var file = File.Create(fileNameWrite))
        {
            Serializer.Serialize(file, TimeSeries);
        }
    }

public static TimeSeries DeserializeAll(string fileNameBinRead)
    {
        TimeSeries TimeSeries;

        using (var file = File.OpenRead(fileNameBinRead))
        {
            TimeSeries = Serializer.Deserialize<TimeSeries>(file);
        }

        return TimeSeries;
    }

It would help if you posted the code snips so people can see if there is some logic error in your code. — Brian, Jan 31 '12 at 15:28
Hi Matt, you may find you get a better response if you post a code example, or link to a solution that reproduces the issue on pastebin. Best regards, — Dr. Andrew Burnett-Thompson, Jan 31 '12 at 15:29
thanks, added a sample object that I used for serialization/deserialization purpose in protobuf-net. I know AskPrice can be adjusted to a short...but I think for comparison's sake with other approaches it wont matter too much as I like to end up with a "TimeSeries" object anyway — Matt, Jan 31 '12 at 15:36
Any chance for more of the test you are running, so I can give an accurate answer? — Marc Gravell, Jan 31 '12 at 16:03
@Marc, thanks for offering to take a look, I added the serializing and deserializing parts. — Matt, Jan 31 '12 at 16:27

score 4 · Accepted Answer · answered Jan 31 '12 at 15:30

4

Fastest way is a handcoded binary serializer, especialyl if you transform pices ticks. That is what I do, although my volume is slightly differenet (600 million items per day, around about 200.000 symbols with some being top heavy). I store nothing in a way that needs parsing from text. Parser is handcrafte and i use profiler to ooptimize it - aos handles size very well (a trade is down to 1 byte sometiems).

answered Jan 31 '12 at 15:30

TomTom

61,059
10
88
148

Tom, I am aware the format can me much more optimized, such as converting datetime to ticks since yearxxxx, short askPrice(spread) = double askPrice - double bidPrice. But would you care to let us know how many quotes/trades you manage to read per second? – Matt Jan 31 '12 at 15:33
Well, i do not deal with doubles - i store pricxe information in a struct that contains ticks & coding. I am not done yet fully with optimizing but we talk of a around 2 million per second per thread (sadly paralellizing this is hard due to delta information per instrument and the reader needing the data serially again - but I can decouple "reading" from "processing" via threading). – TomTom Jan 31 '12 at 15:36
General comment - I've done serialization of this sort of data as a custom binary serializer also - edit: and it was v. fast indeed - and not bothered with Protobuf. However, I'd be interested to see how pbuf is storing DateTimes. Hopefully its being clever and storing the Int64 ticks not the string representation!! :0 – Dr. Andrew Burnett-Thompson Jan 31 '12 at 15:38
agree, this is something I have not done yet (threading of reading and processing), I get to about 600k quotes a second with a single threaded approach. – Matt Jan 31 '12 at 15:39
I go around 1.5 million from my broker interface when back-executing tapes (nxCore - they store data in a replay file to disc), so 600.000 is low for mw (they do a lot of stuff, so there is more overhead). I sugest using a profiler. Also make sure you buffer properly - BufferedStream - to allow the IO subsystem to load in teh background. – TomTom Jan 31 '12 at 15:45
Tom, so I take it you store integer parts at the start of trading and only serialize the deltas? – Matt Jan 31 '12 at 15:45
@Tom, I am only talking about loading stored data rather than processing live feeds. And I only loaded a single file/symbol which should result in more throughput once I involve threading. I have a very fast, self-written merge algorithm, so that is not the issue, I am mostly having problems at the moment with getting the historical data from disk -> memory. – Matt Jan 31 '12 at 15:49
Yes, optionally. I do not storea delta when no chagne is there. The first byte contains the type of info, and whether there is ANY additional info, the seonc contains flags what infos is there with what size. SO a trade same time, same volume is just one byte (Trade, no info). This is quite often the case when you get like CME reporting individual "Parts". Another trade may be 3 bytes (Trade, size 22). Another 4 (Trade, time 2 time ticks, size 2). This means the trade happens 2 time ticks (defined at star) after the last trade. Time resolution is 25ms at the moment. – TomTom Jan 31 '12 at 15:49
Tom, got it, thanks for that. I peruse fx tick data and thus always deal with quote changes (data is pre-cleansed), it may be most optimal to store the handles at midnight (24 hour market) and just store the bid and ask deltas. datetime can also greatly optimized most likely to a long ticks from a specific date. If anyone has better ideas please share. – Matt Jan 31 '12 at 15:57
Acutally for storage it maym ake more sense to store ticks still as delta in milli or seconds. The reader can keep track when the last tick was (absolutely). This puts your bytes per trade for time down to 1.x - many 1 byte (255ms = 0.25 seconds) and soemtiems 2 bytes. Less data = less UI bandwidth needed ;) And less storage. I process full order book (10 bid / ask deep). – TomTom Jan 31 '12 at 16:10
Tom, would you mind giving me some hints how I best create the binaries and subsequently read/deserialize? My problem at the moment is that I want to store vast amounts of data in a single file and cannot read it all at once into memory, thus I need to access only partial data. I imagine its impossible to access partial items in a List, for example, then such list is serialized? Is there a smart way to go about that? – Matt Feb 01 '12 at 18:07
Well I can tell you waht I do - I basiaclyl do not use any serialization except for generic rare stuff. I handcode all dat write and read operations and amek sure I only write very little data. Like a INT may get stored in one byte... if the value is small enough, or 0 bytes if the value is unchanged from the alst occurance. This requires a LOT of planning. it is worth fo financial data obviously. I am going to publish the code for that in a week or so as open source. – TomTom Feb 01 '12 at 18:19
Tom, awesome, would really look forward to take a peek at your code to get more detailed ideas. Have you ever had a chance to compare seek times between your approach and HDF5? Do you mind if I contacted you once you have made some of your code available as open source? – Matt Feb 01 '12 at 18:36
No comparison. This is not my goal. What i need is a fast playback file and a file that can be used over UDP (this can - it is packet oriented). I do not really care about seek times. The goal is to store the feeds in small enough chunks that it makesplayback feasible. The most used scenario is displaying charts, or backtesting - both do not need seek in this scenario. – TomTom Feb 01 '12 at 19:13
just out of curiosity why do you plan to open up your code? Looking for a broader developer audience? or looking to monetize it through consulting or support? – Matt Feb 01 '12 at 19:26
Neither nor. Hoping some people find it usefull. TIt is not really something that special. I wont open up everything, just core transfer classes and data format - whether i ever open up more depends on whatever. I develop that for my own companies use. – TomTom Feb 01 '12 at 20:02
Sounds interesting. I am very keen on taking a look once it becomes available simply because I am admittedly still facing a steep learning curve when it comes to serialization and binary writing/reading issues (I am more of a quant trader rather than developer but I want to gain more knowledge because it feels satisfying to me to create something that rocks and it speeds things up rather than having to ask a developer each time I am stuck). – Matt Feb 02 '12 at 05:00
I mark this as the correct answer because I actually did end up writing my own serializer which turned out to be the fastest. I get to process 2.6 million messages per second far beyond what protobuf-net could accomplish in terms of raw speed. I guess it came down to raw speed vs functionality as as my requirements to access functions is very basic I optimized for speed. Thanks especially to TomTom for your valuable comments. – Matt Feb 03 '12 at 13:50

Why is protobuf-net deserializer so much slower in my code than streamreading csv

1 Answers1