7

I am serializing a large data set using protocol buffer serialization. When my data set contains 400000 custom objects of combined size around 1 GB, serialization returns in 3~4 seconds. But when my data set contains 450000 objects of combined size around 1.2 GB, serialization call never returns and CPU is constantly consumed.

I am using .NET port of Protocol Buffers.

Marc Gravell
  • 1,026,079
  • 266
  • 2,566
  • 2,900
muddxr
  • 71
  • 4
  • Why would you need to serialize such a dataset in one shot? I can't think of any valid reason. – Johann Blais Jun 15 '11 at 12:30
  • Probably you are out of memory and protobuf can not finish serialization - check your memory usage. I think your should split this one big protobuf object into many smaller. This will allow better memory management. – Zuljin Jun 15 '11 at 12:36
  • @muddxr I suspect this would need some kind of reproducible example; I'll ping Jon, but (purely from a protobuf crazy angle) I'd also love to have a look at any example you can post. Does it behave the same in the IDE? if so, you could hit "pause" and see where the stack-trace is - I expect that would be invaluable to Jon. If not, perhaps Sam's tool: http://samsaffron.com/archive/2009/11/11/Diagnosing+runaway+CPU+in+a+Net+production+application would help find what it is doing – Marc Gravell Jun 15 '11 at 13:03
  • Note that by ".NET port of Protocol Buffers", I'm *assuming* you mean Jon's version, and not protobuf-net. If I am mistaken, please let me know. – Marc Gravell Jun 15 '11 at 13:05
  • @Marc We are using Protobuf-net. Anyways, on investigating this problem further, it was found that problem is not specific to protocol buffers. .Net serialization behaved the same way. – muddxr Jun 20 '11 at 05:41
  • @Marc I took a memory dump and found the serialization thread constantly at: 000000000b9fdd50 000007feeee16592 System.IO.MemoryStream.set_Capacity(Int32) 000000000b9fdda0 000007feeee16622 System.IO.MemoryStream.EnsureCapacity(Int32) 000000000b9fdde0 000007feeee1120a System.IO.MemoryStream.WriteByte(Byte) MemoryStream.set_Capacity creates a new buffer double the specified size and write data to this buffer. I think when capacity exceeds int range it causes problem. – muddxr Jun 20 '11 at 05:42
  • @muddxr - ah, if this is protobuf-net then that is me (not Jon). (it also isn't a "port" as such). I'll add an answer with some thoughts... – Marc Gravell Jun 20 '11 at 06:12
  • Also - I have to say: 3-4 seconds for 1GB is pretty good going! – Marc Gravell Jun 21 '11 at 11:28

2 Answers2

7

Looking at the new comments, this appears to be (as the OP notes) MemoryStream capacity limited. A slight annoyance in the protobuf spec is that since sub-message lengths are variable and must prefix the sub-message, it is often necessary to buffer portions until the length is known. This is fine for most reasonable graphs, but if there is an exceptionally large graph (except for the "root object has millions of direct children" scenario, which doesn't suffer) it can end up doing quite a bit in-memory.

If you aren't tied to a particular layout (perhaps due to .proto interop with an existing client), then a simple fix is as follows: on child (sub-object) properties (including lists / arrays of sub-objects), tell it to use "group" serialization. This is not the default layout, but it says "instead of using a length-prefix, use a start/end pair of tokens". The downside of this is that if your deserialization code doesn't know about a particular object, it takes longer to skip the field, as it can't just say "seek forwards 231413 bytes" - it instead has to walk the tokens to know when the object is finished. In most cases this isn't an issue at all, since your deserialization code fully expects that data.

To do this:

[ProtoMember(1, DataFormat = DataFormat.Group)]
public SomeType SomeChild { get; set; }
....
[ProtoMember(4, DataFormat = DataFormat.Group)]
public List<SomeOtherType> SomeChildren { get { return someChildren; } }

The deserialization in protobuf-net is very forgiving (by default there is an optional strict mode), and it will happily deserialize groups in place of length-prefix, and length-prefix in place of groups (meaning: any data you have already stored somewhere should work fine).

Marc Gravell
  • 1,026,079
  • 266
  • 2,566
  • 2,900
1

1.2G of memory is dangerously close to the managed memory limit for 32 bit .Net processes. My guess is the serialization triggers an OutOfMemoryException and all hell breaks loose.

You should try to use several smaller serializations rather than a gigantic one, or move to a 64bit process.

Cheers, Florian

Florian Doyon
  • 4,146
  • 1
  • 27
  • 37