How to speed up dill serialization to store Python object to file

Question

It says in the documentation that the output of sys.getsizeof() is in bytes. I'm trying to store a data structure that is a dictionary of class instances and lists. I did sys.getsizeof() on this dictionary of class instances and it was 3352 bytes. I'm serializing it using dill so I could load it later but it's taking a really, really long time.

The file size is already 260 MB which is much larger than 3352 bytes specified by sys.getsizeof(). Does anyone know why the values are different and why it is taking so long to store?

Is there a more efficient way to store objects like this when running on a 4GB memory Mac Air?

It's an incredible tool . I'm not sure if there is any parameters I can tweak to help with my low memory issue. I know there's a protocol=2 for pickle but it doesn't seem to store the environment as well as dill.

sys.getsizeof(D_storage_Data) #Output is 3352
dill.dump(D_storage_Data,open("storage.obj","wb"))

I was thinking more about the structure of D_storage_Data. You are asking for a more efficient way to store a data structure, but you are not saying what the data structure is. Does your data structure contain any functions, object references, or classes? Do you perhaps have any self-referential recursion in your structure? It sounds like more and more data is being dynamically generated during serialization. — RobertB, Oct 16 '15 at 22:33
"I did sys.getsizeof() on a dictionary of class instances and it was 3352 bytes" This is my data structure. I'll make it more explicit in the question. — O.rka, Oct 16 '15 at 23:55

RobertB · Answer 1 · 2015-10-17T01:07:46.260

Watch this:

>>>  x = [ i for i in range(255) ]
>>>  sys.getsizeof(x)
2216
>>>  d = { 1 : x }
>>>  sys.getsizeof(d)
288
>>>  s = pickle.dumps(d) # Dill is similar, I just don't have it installed on this computer
>>>  sys.getsizeof(s)
557

The size of 'd' is just the size of the dict object itself (the class, methods, keys and overall structure of the dict) along with a pointer to 'x'. It does not include the size of 'x' at all.

When you serialize 'd' however, it has to serialize both 'd' and 'x' in order to be able to de-serialize into a meaningful dict later. This is the basis for why your file is bigger than the bytes from your call. And you can see, the serializer does a good job of packing it up actually.

score 3 · Accepted Answer · edited May 23 '17 at 10:31

3

I'm the dill author. See my comment here: If Dill file is too large for RAM is there another way it can be loaded. In short, the answer is that it depends on what you are pickling… and if it's class instances, the answer is yes. Try the byref setting. Also if you are looking to store a dict of objects, you might want to map your dict to a directory of files, by using klepto -- that way you can dump and load individual elements of the dict individually, and still work out of a dict API.

So especially when using dill, and especially in a ipynb, check out dill.settings... Serialization (dill, pickle, or otherwise) recursively pulls objects into the pickle, and so often can pull in all of globals. Use dill.settings to change what is stored by reference and what is stored by pickling.

edited May 23 '17 at 10:31

Community

1
1

answered Oct 17 '15 at 01:54

Mike McKerns

33,715
8
119
139

I ended up running on my universities cluster and it worked perfectly in no time at all. it's weird how my files were getting so large compared to when i ran it on the cluster. i was running it in ipython notebook it seemd like there was a feedback loop – O.rka Oct 17 '15 at 02:34
this will be really useful for when i use it again. thanks a lot Mike! this module is awesome. – O.rka Oct 17 '15 at 02:35

How to speed up dill serialization to store Python object to file

2 Answers2