4

I have a very large python shelve object (6GB on disk). I want to be able to move it to another machine, and since shelves are not portable, I wanted to cPickle it. To do that, I first have to convert it to a dict.

For some reason, when I do dict(myShelf) the ipython process spikes up to 32GB of memory (all my machine has) and then seems to hang (or maybe just take a really long time).

Can someone explain this? And perhaps offer a potential workaround?

edit: using Python 2.7

pocketfullofcheese
  • 8,427
  • 9
  • 41
  • 57

1 Answers1

5

From my experience I'd expect pickling to be even more of a memory-hog than what you've done so far. However, creating a dict loads every key and value in the shelf into memory at once, and you shouldn't assume because your shelf is 6GB on disk, that it's only 6GB in memory. For example:

>>> import sys, pickle
>>> sys.getsizeof(1)
24
>>> len(pickle.dumps(1))
4
>>> len(pickle.dumps(1, -1))
5

So, a very small integer is 5-6 times bigger as a Python int object (on my machine) than it is once pickled.

As for the workaround: you can write more than one pickled object to a file. So don't convert the shelf to a dict, just write a long sequence of keys and values to your file, then read an equally long sequence of keys and values on the other side to put into your new shelf. That way you only need one key/value pair in memory at a time. Something like this:

Write:

with open('myshelf.pkl', 'wb') as outfile:
    pickle.dump(len(myShelf), outfile)
    for p in myShelf.iteritems():
        pickle.dump(p, outfile)

Read:

with open('myshelf.pkl', 'rb') as infile:
    for _ in xrange(pickle.load(infile)):
        k, v = pickle.load(infile)
        myShelf[k] = v

I think you don't actually need to store the length, you could just keep reading until pickle.load throws an exception indicating it's run out of file.

Steve Jessop
  • 273,490
  • 39
  • 460
  • 699
  • do you mean by iterating through the shelf? – pocketfullofcheese Jan 28 '15 at 23:54
  • I mean `for k, v in myShelf.items():` (or use `iteritems` in Python 2). – Steve Jessop Jan 29 '15 at 00:00
  • +1 When converting to a `dict`, you add memory overhead in the form of internal book-keeping of the data structure. This is a very good idea. – salezica Jan 29 '15 at 00:10
  • @uʍopǝpısdn: more to the point, when converting to `dict` everything needs to be in memory at once. Even if there's enough swap to cope, it's not pretty. – Steve Jessop Jan 29 '15 at 00:11
  • Incredible that it would be so much overhead (more than 4 times the content) for a dict. Its supposed to just be a hash table, right? – pocketfullofcheese Jan 29 '15 at 00:11
  • @pocketfullofcheese: you shouldn't assume because your data is 6GB on disk, that it's 6GB in memory too. Pickled objects often require less storage than the memory they occupy in full un-pickled form. There's overhead everywhere, in every Python object, not just the `dict` itself. – Steve Jessop Jan 29 '15 at 00:15
  • It's not just the hashes. As Steve says, the hash-table for the basic `dict` data structure is not meant to store so many entries. It has no offloading mechanism, it will hold every key-value pair in memory simultaneously – salezica Jan 29 '15 at 00:15