2

I am trying to serialize an object containing a list of very large composite object graphs (~200000 nodes or more) using Protobuf-net. Basically what I want to achieve is to save the complete object into a single file as fast and as compact as possible.

My problem is that I get an out-of-memory-exception while trying to serialize the object. On my machine the exception is thrown when the file size is around 1.5GB. I am running a 64 bit process and using a StreamWriter as input to protobuf-net. Since I am writing directly to a file I suspect that some kind of buffering is taking place within protobuf-net causing the exception. I have tried to use the DataFormat = DataFormat.Group attribute but with no luck so far.

I can avoid the exception by serializing each composite in the list to a separate file but I would prefer to have it all done in one go if possible.

Am I doing something wrong or is it simply not possible to achieve what i want?

Code to illustrate the problem:

class Program
{
    static void Main(string[] args)
    {
        int numberOfTrees = 250;
        int nodesPrTree = 200000;

        var trees = CreateTrees(numberOfTrees, nodesPrTree);
        var forest = new Forest(trees);

        using (var writer = new StreamWriter("model.bin"))
        {
            Serializer.Serialize(writer.BaseStream, forest);
        }

        Console.ReadLine();
    }

    private static Tree[] CreateTrees(int numberOfTrees, int nodesPrTree)
    {
        var trees = new Tree[numberOfTrees];
        for (int i = 0; i < numberOfTrees; i++)
        {
            var root = new Node();
            CreateTree(root, nodesPrTree, 0);
            var binTree = new Tree(root);
            trees[i] = binTree;
        }
        return trees;
    }

    private static void CreateTree(INode tree, int nodesPrTree, int currentNumberOfNodes)
    {
        Queue<INode> q = new Queue<INode>();
        q.Enqueue(tree);
        while (q.Count > 0 && currentNumberOfNodes < nodesPrTree)
        {
            var n = q.Dequeue();
            n.Left = new Node();
            q.Enqueue(n.Left);
            currentNumberOfNodes++;

            n.Right = new Node();
            q.Enqueue(n.Right);
            currentNumberOfNodes++;
        }
    }
}

[ProtoContract]
[ProtoInclude(1, typeof(Node), DataFormat = DataFormat.Group)]
public interface INode
{
    [ProtoMember(2, DataFormat = DataFormat.Group, AsReference = true)]
    INode Parent { get; set; }
    [ProtoMember(3, DataFormat = DataFormat.Group, AsReference = true)]
    INode Left { get; set; }
    [ProtoMember(4, DataFormat = DataFormat.Group, AsReference = true)]        
    INode Right { get; set; }
}

[ProtoContract]
public class Node : INode
{
    INode m_parent;
    INode m_left;
    INode m_right;

    public INode Left
    {
        get
        {
            return m_left;
        }
        set
        {
            m_left = value;
            m_left.Parent = null;
            m_left.Parent = this;
        }
    }

    public INode Right
    {
        get
        {
            return m_right;
        }
        set
        {
            m_right = value;
            m_right.Parent = null;
            m_right.Parent = this;
        }
    }

    public INode Parent
    {
        get
        {
            return m_parent;
        }
        set
        {
            m_parent = value;
        }
    }
}

[ProtoContract]
public class Tree
{
    [ProtoMember(1, DataFormat = DataFormat.Group)]
    public readonly INode Root;

    public Tree(INode root)
    {
        Root = root;
    }
}

[ProtoContract]
public class Forest
{
    [ProtoMember(1, DataFormat = DataFormat.Group)]
    public readonly Tree[] Trees;

    public Forest(Tree[] trees)
    {
        Trees = trees;
    }
}

Stack-trace when the exception is thrown:

at System.Collections.Generic.Dictionary`2.Resize(Int32 newSize, Boolean forceNewHashCodes)
at System.Collections.Generic.Dictionary`2.Insert(TKey key, TValue value, Boolean add)
at ProtoBuf.NetObjectCache.AddObjectKey(Object value, Boolean& existing) in NetObjectCache.cs:line 154
at ProtoBuf.BclHelpers.WriteNetObject(Object value, ProtoWriter dest, Int32 key, NetObjectOptions options) BclHelpers.cs:line 500
at proto_5(Object , ProtoWriter )

I am trying to do a workaround where I serialize the array of trees one at a time to a single file using the SerializeWithLengthPrefix method. Serialization seems work - I can see the filesize is increased after each tree in the list is added to the file. However, when I try to Deserialize the trees I get the Invalid wire-type exception. I am creating a new file when I serialize the trees so the file should be garbage free - unless I am writing garbage of cause ;-). My serialize and deserialization methods are listed below:

using (var writer = new FileStream("model.bin", FileMode.Create))
{
    foreach (var tree in trees)
    {
        Serializer.SerializeWithLengthPrefix(writer, tree, PrefixStyle.Base128);
    }
}

using (var reader = new FileStream("model.bin", FileMode.Open))
{
    var trees = Serializer.DeserializeWithLengthPrefix<Tree[]>>(reader, PrefixStyle.Base128);
}

Am I using the method in a incorrect way?

mda
  • 53
  • 5
  • On mobile at the moment - will run it through a debugger later – Marc Gravell Apr 03 '13 at 17:46
  • I have edited your title. Please see, "[Should questions include “tags” in their titles?](http://meta.stackexchange.com/questions/19190/)", where the consensus is "no, they should not". – John Saunders Apr 03 '13 at 17:47
  • @MarcGravell Thanks! looking forward to your findings – mda Apr 03 '13 at 17:58
  • side note; it *looks* like you're actually double-serializing at the moment, because your `Parent` / `Left` / `Right` are serialized **both** on the interface **and** the public `Node` API. I've removed the attributes from `Node` in my local - it'll still hit the dictionary limit, but it will probably use half the disk space when dictionary dies :) edit: yup - 840MB when dying now. – Marc Gravell Apr 03 '13 at 23:12
  • slightly content that `BinaryFormatter` *also* dies in this particular scenario :p But: if you can confirm for me the stack-trace is the `Dictionary`2.Resize` issue, then it is *probably solvable* without too much pain... what's a little sharding between friends, eh? – Marc Gravell Apr 03 '13 at 23:19
  • Thanks for pointing out the double serialization. It also dies at 840MB on my machine with this correction. I'm still very new to protobuf-net so if you see anything else that looks strange please point it out :-). I'll update with a stacktrace shortly. – mda Apr 04 '13 at 07:48
  • I have added the stack-trace and it shows - as you suspected - that it is the Dictionary'2.Resize issue. Is it solvable? – mda Apr 04 '13 at 08:29
  • @mda I honestly don't know. I did some work last night to try implementing sharding - but I think I have some kinks to figure out – Marc Gravell Apr 04 '13 at 19:01
  • @MarcGravell Sounds grim. Please let me know if there is anything I can do to help the process. I can get by using the serialize each composite to a seperate file, but it would of cause be preferable to have support for this scenario. At least if it is beneficial to other projects than this one. – mda Apr 04 '13 at 22:02
  • @MarcGravell As a side question. Is there a way to serialize the composite graphs separately but to a single file using protobuf-net? – mda Apr 04 '13 at 22:03
  • @mda if you have a sequence of separate pieces, then `SerializeWithLengthPrefix` (and the similar to deserialize) should allow for that. – Marc Gravell Apr 05 '13 at 06:24
  • @MarcGravell Thanks, I will give SerializeWithLengthPrefix a try later today. And a big thanks for taking the time to look into this. – mda Apr 05 '13 at 06:56
  • @MarkGravel, did either of you solve this out-of-memory-exception while serializing larger object structures? – oakman Sep 12 '17 at 18:31
  • @oakman, I ended up modifying my structure to make it more compact and efficient. So sadly, I never found a solution for the out-of-memory-exception. – mda Sep 21 '17 at 09:25

1 Answers1

0

It wasn't helping that the AsReference code was only respecting default data-format, which means it was trying to hold data in memory so that it can write the object-length prefix back into the data-stream, which is exactly what we don't want here (hence your quite correct use of DataFormat.Group). That will account for buffering for an individual branch of the tree. I've tweaked it locally, and I can definitely confirm that it is now writing forwards-only (the debug build has a convenient ForwardsOnly flag that I can enable which detects this and shouts).

With that tweak, I have had it work for 250 x 20,000, but I'm getting secondary problems with the dictionary resizing (even in x64) when working on the 250 x 200,000 - like you say, at around the 1.5GB level. It occurs to me, however, that I might be able to discard one of these (forwards or reverse) respectively when doing each of serialization / deserialization. I would be interested in the stack-trace when it breaks for you - if it is ultimately the dictionary resize, I may need to think about moving to a group of dictionaries...

Marc Gravell
  • 1,026,079
  • 266
  • 2,566
  • 2,900