0

I have a big (several Gigs) nested dictionary of this structure:

{ 
  string1: {string1_1: int1_1, string1_2: int1_2, ...},
  string2: {string2_1: int2_1, string2_2: int2_2, ...},
...
}

Its a kind of word co-occurences counts in a big text corpus, so the amount of keys in inner dicts varies.

I am trying to find the fastest way to save this structure to the hard drive for reusing. Pickle/cpickle.dump is impossibly slow. Msgpack.pack is better but also it is faster for me to recalculate the whole dict from raw data than to dump-load it.

Does anybody have any experience of serializing such huge dicts? Any tips/tricks and libraries are appreciated.

Seanny123
  • 8,776
  • 13
  • 68
  • 124
Alexey Trofimov
  • 4,287
  • 1
  • 18
  • 27
  • When you try to pickle which `protocol` are you using? Refer to the docstring for `pickle.dump()` – jeschwar Mar 22 '18 at 14:13
  • Since the processing is likely I/O bound, perhaps compressing the data before (or ideally as) it's being written would help using e.g. the `gzip` module. – martineau Mar 22 '18 at 14:15
  • Another possibility would be to replace the outer dictionary with a "shelf" created bu the `shelve` module. That way the data-structure would effectively already be in a file so there would be nothing to do but but `sync()` then `close()` it before the program ends. A “shelf” is a persistent, dictionary-like object, see the docs. – martineau Mar 22 '18 at 14:24
  • @jeschwar tried default and the latest, both are slow. I just dont get why dumpuing part of memory to the hard drive takes so long. – Alexey Trofimov Mar 22 '18 at 14:26
  • `pickle` does a great deal more than just dump memory. It has to maintain internal object references and references to classes that are in your program but not in the object you are pickling. – BoarGules Mar 22 '18 at 14:30
  • So I tried compressing before writing the pickle's dump on file, but since compression algorithms are more than linear and IO, while slow, is linear it turns out that we save a lot of disk space, but it is slower by a factor of 1.5 to 2 depending on the compression level. – Olivier Melançon Mar 22 '18 at 14:51
  • @AlexeyTrofimov have you considered using HDF5? I regularly use this with the `pandas` implementation via `pytables`. Some information and other formats can be found in the `pandas` docs [here](http://pandas.pydata.org/pandas-docs/stable/io.html) – jeschwar Mar 22 '18 at 15:36
  • @jeschwar unfortunately hdf5 is 3x slower for dumping dicts than pickle and 6x slower than msgpack. – Alexey Trofimov Mar 22 '18 at 16:54

0 Answers0