1

I posted a related but still different question regarding Protobuf-Net before, so here goes:

I wonder whether someone (esp. Marc) could comment on which of the following would most likely be faster:

(a) I currently store serialized built-in datatypes in a binary file. Specifically, a long(8 bytes), and 2 floats (2x 4 bytes). Each 3 of those later make up one object in deserialized state. The long type represents DateTimeTicks for lookup purposes. I use a binary search to find the start and end locations of a data request. A method then downloads the data in one chunk (from start to end location) knowing that each chunk consists of a packet of many of above described triplets(1 long, 1 float, 1 float) and each triplet is always 16 bytes long. Thus the number triples retrieved is always (endLocation - startLocation)/16. I then iterate over the retrieved byte array, deserialize (using BitConverter) each built-in type and then instantiate a new object made up of a triplet each and store the objects in a list for further processing.

(b) Would it be faster to do the following? Build a separate file (or implement a header) that functions as index for lookup purposes. Then I would not store individual binary versions of the built-in types but instead use Protbuf-net to serialize a List of above described objects (= triplet of int, float, float as source of object). Each List would contain exactly and always one day's worth of data (remember, the long represents DateTimeTick). Obviously each List would vary in size and thus my idea of generating another file or header for index lookup purposes because each data read request would only request a multiple of full days. When I want to retrieve the serialized list of one day I would then simply lookup the index, read the byte array, deserialize using Protobuf-Net and already have my List of objects. I guess why I am asking is because I do not fully understand how deserialization of collections in protobuf-net works.

To give a better idea about the magnitude of the data, each binary file is about 3gb large, thus contains many millions of serialized objects. Each file contains about 1000 days worth of data. Each data request may request any number of day's worth of data.

What in your opinion is faster in raw processing time? I wanted to garner some input before potentially writing a lot of code to implement (b), I currently have (a) and am able to process about 1.5 million objects per second on my machine (process = from data request to returned List of deserialized objects).

Summary: I am asking whether binary data can be faster read I/O and deserialized using approach (a) or (b).

Matt
  • 7,004
  • 11
  • 71
  • 117
  • 2
    Wall. Of. Text. TL;DR. Can you summarise your question at the end for those of us who want to quickly see what the problem/question is without wading through masses of text? – slugster Jun 19 '12 at 07:59
  • The summary is in the headline, I am afraid the question cannot be asked with less details because such details make all the difference in the answers given...but I added a quick summary nonetheless. – Matt Jun 19 '12 at 09:42

2 Answers2

5

I currently store serialized built-in datatypes in a binary file. Specifically, a long(8 bytes), and 2 floats (2x 4 bytes).

What you have is (and no offence intended) some very simple data. If you're happy dealing with raw data (and it sounds like you are) then it sounds to me like the optimum way to treat this is: as you are. Offsets are a nice clean multiple of 16, etc.

Protocol buffers generally (not just protobuf-net, which is a single implementation of the protobuf specification) is intended for a more complex set of scenarios:

  • nested/structured data (think: xml i.e. complex records, rather than csv i.e. simple records)
  • optional fields (some data may not be present at all in the data)
  • extensible / version tolerant (unexpected or only semi-expected values may be present)
    • in particular, can add/deprecate fields without it breaking
  • cross-platform / schema-based
  • and where the end-user doesn't need to get involved in any serialization details

It is a bit of a different use case! As part of this, protocol buffers uses a small but necessary field-header notation (usually one byte per field), and you would need a mechanism to separate records, since they aren't fixed-size - which is typically another 2 bytes per record. And, ultimately, the protocol buffers handling of float is IEEE-754, so you would be storing the exact same 2 x 4 bytes, but with added padding. The handling of a long integer can be fixed or variable size within the protocol buffers specification.

For what you are doing, and since you care about fastest raw processing time, simple seems best. I'd leave it "as is".

Marc Gravell
  • 1,026,079
  • 266
  • 2,566
  • 2,900
  • thanks and I appreciate your advice, but do you mind elaborating how the deserialization of a List of objects works in Protobuf-Net? I guess what I really like to know is whether deserializing each object manually (as I currently do) is faster or whether the deserialization of a List that potentially contains hundreds of thousands of objects is faster. Or will Protobuf-Net deserialize each item in the list individually? Sorry but I am not an expert and this question may sound trivial to you. – Matt Jun 19 '12 at 09:49
  • 1
    Lists (or `repeated` in the protocol buffers parlance) are typically encoded as a repeated sequence of either: [header][length][sub-message], or [header][sub-message][footer], but in either case the [sub-message] is encoded as an object. There are ways of skipping through a protobuf stream, but since each record is not fixed length, this is trickier than just jumping to "16 x index" – Marc Gravell Jun 19 '12 at 11:44
  • Thanks for the explanation, I went in the end with my current solution, I was hoping that a collection is serialized without having to serialize each element (in your case header, length, sub-message, if I understand you correctly). This does not seem to be the case (I guess I lack a full understanding of how serialization works in detail). If a list of items serialized faster than serializing each item individually then I may have tried it out but it seems thats not how it works. Please correct if I failed to understand your explanation. – Matt Jun 20 '12 at 03:41
1

I think using a "chunk" per day together with an index is a good idea since it will let you do random access as long as each record is 16 byte fixed size. If you have an index keeping track of the offset to each day in the file, you can also use memory mapped files to create a very fast view of the data for a specific day or range of days.

One of the benefits of protocol buffers is that they make fixed size data variable sized, since it compresses values (e.g. a long value of zero is written using one byte). This may give you some issues with random access in huge volumes of data.

I'm not the protobuf expert (I have a feeling that Marc will fill you in here) but my feeling is that Protocol Buffers are really best suited for small to medium sized volumes of nontrivial structured data accessed as a whole (or at least in whole records). For very large random access streams of data I don't think there will be a performance gain as you may lose the ability to do simple random access when different records may be compressed by different amounts.

Anders Forsgren
  • 10,827
  • 4
  • 40
  • 77
  • I am not sure I fully comprehend your suggested answer. Even in (b) using Protobuf-Net I would be able to do random retrievals because each single day would be a protobuf-net serialized list and the offsets of the binary data would be stored in an index so the serialized lists can through binary search of the index be easily accessed randomly. – Matt Jun 19 '12 at 09:51