0

This question is sort of a follow on from SerializationException when serializing lots of objects in .NET.

Situation: I have a network of nodes that are all interconnected and may have somewhere between 10-30 variables and/or references per node. The network is about 9 million entries, but I have cropped out a section of 11,000 entries and cut off the references that point to the rest of the network.

I'm trying to write this section of the network to disk but I'm getting the following error:

System.Runtime.Serialization.SerializationException
"The internal array cannot expand to greater than Int32.MaxValue elements."

NOTE: As pointed out by stuartd the limit to the number of items that can be serialized is 6 million.

The most likely reason for this is that there is still a connection to the rest of the network that I'm not aware of, however I have searched the code in great detail attempting to find where a potential connection might remain, but without any luck (I am going to keep looking so this may still be the cause, but I wanted to also explore other avenues).

Question: What pitfalls/easy to do mistakes are there with BinaryFormatter that I might be encountering? And what can I do to overcome this size limit?

Edit: Added the serialization code. "this" is my network object that contains the 11,000 nodes.

    Stream testFileStream = File.Create(filename);
    BinaryFormatter serilizer = new BinaryFormatter();
    serilizer.Serialize(testFileStream, this);
    testFileStream.Close();
Community
  • 1
  • 1
Adrian773
  • 473
  • 11
  • 22
  • Emphasis on "potential" as well guys. – Adrian773 Mar 19 '15 at 21:53
  • 1
    According to [this article](http://www.thelowlyprogrammer.com/2010/02/straining-limits-of-c.html) the actual limit is around 6M items – stuartd Mar 19 '15 at 22:15
  • 3
    You might just want to start overriding and implementing serialization yourself. The metadata stored in a BinarySerialized object is quite large. private class Widget { private int _f = 3; } Takes 140 bytes to store... If I make the namespace longer, more bytes. Change _f to _field... 4 more bytes. This gives you the ability to deserialize even if you re-order your fields and add some fields in a latter version of the DLL but maybe you don't need those features and want to just store the integer value of the widget with 4 bytes. – user922020 Mar 19 '15 at 22:50
  • @user922020 Are you suggesting overriding with the intention of reducing the amount of data that is actually going to be stored or to increase the amount of data that can be stored? Most of my stored data presently is in a dictionary and a list of nodes but the dictionary is going to be changed to a number of explicitly set properties in the near future. I do not believe anything can really be cut out in regards to string key or property name and the priority should be on quick deserialization as well. – Adrian773 Mar 22 '15 at 21:58
  • Reducing. But it really depends what the system considers an "element". I'll attempt an answer but I don't have your data set to test on. – user922020 Mar 24 '15 at 15:47

1 Answers1

2

People are minusing your question because it isn't specific enough to answer.
But shedding some light on the BinaryFormatter could help.

So what you probably want to do is avoid Serialization alltogether and just make your own read and your own write. Like this question...

BinaryFormatter alternatives

If you avoid BinarySerializer completely, there won't be any element counts to cause exceptions. But the BinarySerializer does protect itself from infinite loops and other things you will have to take into account if you are serializing some kind of network node graph. It is a lot of work.

Before we get any further, BinaryFormatter serializes private variables as well as public. Is there any chance that you stored something massive in a private variable and didn't count it as part of the 30 bytes per object?

Why avoid serialization? Well looking more deeply into the BinaryFormatter we see that it has lots of overhead that allows it to stream data from multiple versions of your DLL, it stores the private variable names in case you re-order the fields in your class. It has features. If you don't need features and you want fast performance then avoid it.

Example.

    [Serializable]
    class Widgt2
    {
        private string _fieldWithMuchLongerName = "XXX";
    }

If you just serialize it to a memory stream and then view the bytes of it you get...

   "\0\0\0\0ÿÿÿÿ\0\0\0\0\0\0\0\f\0\0\0@SOAnswers, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null\0\0\0SOAnswers.Serialzation1+Widget2\0\0\0_fieldWithMuchLongerName\0\0\0\0\0\0XXX\v"

That's a whole lot of bytes to store the string "XXX". The binary formatter gets more efficient if you store repeats like List<>. But it still has features which means overhead.

Community
  • 1
  • 1
user922020
  • 712
  • 6
  • 12
  • If I could +1 again for letting me know why they didn't like my question I would. – Adrian773 Mar 24 '15 at 20:00
  • Most of the time they want to reproduce the error which means giving them a lot more source code... at least the class being serialized and some kind of generator to create eleven thousand of them. I've spent more time anonomizing and disconnecting my code for StackOverflow questions than I like to admit. – user922020 Mar 25 '15 at 17:41
  • My first thought was that your node graph must be tricking the BinarySerializer into an infinite loop but it is coded to detect loops. Not sure what it would take to trick it. – user922020 Mar 25 '15 at 17:42