1

I have produced a list of dictionary of 8100000 bytes, with 9 million+ elements. Each element has a dictionary of 32 pairs of value and key, though the same set of keys is used in each element.

I wanted to save it for future analysis. I have tried dill.dump, but it took forever (more than 1 hour) that I had to interrupt the kernel. This is suppose to be fast and easy, right?

Here is what I have tried:

import dill
output_file=open('result_list', 'wb')
dill.dump(result_list, output_file)
output_file.close()

I also tried pickle and bzip2

import bz2
import pickle
output_file=bz2.BZ2File('result_list', 'w')
pickle.dump(result_list, output_file)

But ran into memory error.

Any tips on making this feasible and less time consuming? Thanks!

xiaoshir
  • 215
  • 4
  • 17
  • try [pickle](https://docs.python.org/2/library/pickle.html) dump – Yugandhar Chaudhari Feb 08 '19 at 12:03
  • just tried and updated the question – xiaoshir Feb 08 '19 at 12:17
  • see this [benchmarks](https://blog.hartleybrody.com/python-serialize/) – Yugandhar Chaudhari Feb 08 '19 at 12:21
  • What is the content of your dictionary? Maybe you have large keys and/or values. Dill is also capable of complex serialisations of classes and functions, so if you're serialising objects that are instances of user-defined objects that could also slow things down. – Dunes Feb 08 '19 at 14:18
  • ... or if it's something that's built on large `numpy` arrays... there's a lot of objects that take a while. `dill` has `dill.settings`, which allows you to pick the flavor of serialization, and that can often make a difference on how much is actually stored. – Mike McKerns Feb 08 '19 at 17:44

1 Answers1

6

I'm the dill author. You may want to try klepto for this case. dill (actually any serializer) will treat the entire dict as a single object... and something of that size, you might want to treat more like a database of entries... which is what klepto can do. The fastest approach is probably to use the archive that treats each entry as a different file in a single directory on disk:

>>> import klepto
>>> x = range(10000)
>>> d = dict(zip(x,x))
>>> a = klepto.archives.dir_archive('foo', d)
>>> a.dump()

The above makes a directory with 10000 subdirectories with one entry each in it. Keys and values are both stored. Note you can tweak the serialization method a bit as well, so check the docs to see how to do that for your custom case.

Alternately, you could iterate over the dict, and serialize each of the entries with dump inside a parallel map from a multiprocess.Pool.

(Side note, I'm the author of multiprocess and klepto as well).

UPDATE: as the question was changed from serializing a huge dict, to serializing a huge list of small dicts... this changes the answer.

klepto is built for large dict-like structures so it's probably not what you want then. You may want to try dask, which is built for large array-like structures.

I think you could also iterate over the list, serializing each of the list entries individually... and as long as you loaded them in the same order, you'd be able to reconstitute your results. You could do something like store the position with the value, so that you can restore the list and then sort if they got out of order.

I'd also ask you to think if you have your results could be restructured to be in a better form...

Mike McKerns
  • 33,715
  • 8
  • 119
  • 139
  • Thanks a lot for the answer. After a closer look at my data, it turns out to be a list of dictionary (rather than a dictionary), with 9 million+ elements. Each element has a dictionary of 32 pairs of value and key, though the same set of keys is used in each element. It seems that `klepto.archives.dir_archive` only works with dict? I have updated my question too. – xiaoshir Feb 11 '19 at 10:36
  • I managed to transform my data to a pandas dataframe with 9 million rows * 32 columns, and with `df.to_pickle("result_dict.pkl")`, it was super fast. Thanks for all the help. – xiaoshir Feb 12 '19 at 16:30
  • Hi @MikeMcKerns! Can `dill` be used to solve this problem?: https://stackoverflow.com/q/69430747/6907424 – hafiz031 Oct 05 '21 at 06:35
  • @hafiz031: as a first approach , you can try replacing `multiprocessing` with `multiprocess`. The latter uses `dill`. – Mike McKerns Oct 05 '21 at 12:55
  • Thanks, could you elaborate? I am not quite sure how do I maintain a dictionary with `dill` such that I can hold non picklable objects in them and share across processes. Question link: https://stackoverflow.com/q/69430747/6907424 – hafiz031 Oct 05 '21 at 14:52