9

I need a memory efficient int-int dict in Python that would support the following operations in O(log n) time:

d[k] = v  # replace if present
v = d[k]  # None or a negative number if not present

I need to hold ~250M pairs, so it really has to be tight.

Do you happen to know a suitable implementation (Python 2.7)?

EDIT Removed impossible requirement and other nonsense. Thanks, Craig and Kylotan!


To rephrase. Here's a trivial int-int dictionary with 1M pairs:

>>> import random, sys
>>> from guppy import hpy
>>> h = hpy()
>>> h.setrelheap()
>>> d = {}
>>> for _ in xrange(1000000):
...     d[random.randint(0, sys.maxint)] = random.randint(0, sys.maxint)
... 
>>> h.heap()
Partition of a set of 1999530 objects. Total size = 49161112 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0      1   0 25165960  51  25165960  51 dict (no owner)
     1 1999521 100 23994252  49  49160212 100 int

On average, a pair of integers uses 49 bytes.

Here's an array of 2M integers:

>>> import array, random, sys
>>> from guppy import hpy
>>> h = hpy()
>>> h.setrelheap()
>>> a = array.array('i')
>>> for _ in xrange(2000000):
...     a.append(random.randint(0, sys.maxint))
... 
>>> h.heap()
Partition of a set of 14 objects. Total size = 8001108 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0      1   7  8000028 100   8000028 100 array.array

On average, a pair of integers uses 8 bytes.

I accept that 8 bytes/pair in a dictionary is rather hard to achieve in general. Rephrased question: is there a memory-efficient implementation of int-int dictionary that uses considerably less than 49 bytes/pair?

Bolo
  • 11,542
  • 7
  • 41
  • 60
  • 1
    Perhaps I am not thinking straight, but I don't see how your proposed implementation (with keys at even locations of array; values at odd locations) could be *O(log n)* for both insertions and lookups. – Craig McQueen Oct 26 '10 at 12:41
  • @Craig Oh, you're right. In my implementation one cannot do lookups in _O(log n)_ (for keys other than the smallest). – Bolo Oct 26 '10 at 13:15
  • How does the 250M pairs relate to the range of key-values? Are there 250M possible keys and 250M actual pairs so the array is 100% dense? – hughdbrown Oct 26 '10 at 15:13
  • 1
    @hughdbrown The keys are hashes of strings, so there are 4G possible keys. There will be ~500 different dictionaries of varying sizes (1K to 20M) with ~125M pairs in total. FYI, the pairs are (page id, page title hash) from all the language editions of Wikipedia indexed both ways. – Bolo Oct 26 '10 at 20:45
  • If they are hashes, isn't a hashtable the appropriate data structure here? Do you get to chose your hashing function? Indexed both ways changes the question a bit. – Paul McMillan Oct 27 '10 at 07:40
  • Hi, I've rephrased my question. I hope it's a bit clearer now. – Bolo Oct 27 '10 at 08:32
  • Actually, I've noticed that in my particular case I can rearrange the operations in such a way that all the writes occur before the first read. Thanks to this, I can: 1) append all the key-value pairs to an `array`; 2) sort by keys; 3) access the values using binary search. However, I wonder how to achieve a memory-efficient int-int dictionary in a general case, so the question still stands. – Bolo Oct 27 '10 at 08:40
  • You may also interested to look at some Judy-array solution, [here][1] [1]: http://stackoverflow.com/questions/18041848/efficient-way-to-hold-and-process-a-big-dict-in-memory-in-python/18042374#18042374 – Jason Xu Aug 07 '13 at 02:32

6 Answers6

6

You could use the IIBtree from Zope

John La Rooy
  • 295,403
  • 53
  • 369
  • 502
  • Thanks, at first I didn't realize that `IIBtree` is using "primitive" (in the Java sense of the term) ints. It's a useful structure! Unfortunately it forces a dependency on Zope, which is rather heavy. – Bolo Dec 29 '10 at 21:51
  • Thanks! I have a similar requirement, and IIBTree is just the perfect data structure! Also, I installed BTrees-4.0.5 and it only pulled in persistent-4.0.6 as dependency. – sayap Mar 29 '13 at 08:37
5

I don't know if this is a one-shot solution, or part of an ongoing project, but if it's the former, is throwing more ram at it cheaper than the necessary developer time to optimize the memory usage? Even at 64 bytes per pair, you're still only looking at 15GB, which would fit easily enough into most desktop boxes.

I think the correct answer probably lies within the SciPy/NumPy libraries, but I'm not familiar enough with the library to tell you exactly where to look.

http://docs.scipy.org/doc/numpy/reference/

You might also find some useful ideas in this thread: Memory Efficient Alternatives to Python Dictionaries

Community
  • 1
  • 1
Paul McMillan
  • 19,693
  • 9
  • 57
  • 71
  • I agree. I also think that Numpy is among the most probable solutions for (memory) efficient numerical arrays. – extraneon Oct 26 '10 at 19:02
  • 1
    Thanks for your input! It's a repetitive task, and it's done "after hours", with $0 budget, on a box which I cannot upgrade. Besides, I guess it's better to implement a memory-efficient, reusable data structure once than to throw money into hardware every time one runs out of memory. From my limited experience, SciPy/NumPy focuses on floating-point numbers, and is of little use here. The referenced thread is interesting indeed, thanks for sharing! – Bolo Dec 29 '10 at 22:09
  • Having a memory efficient int-int dict is a generally useful thing that _should_ exist. It _should_ be a matter of googling or asking on stackoverflow which library is most suitable. This _should_ be much easier than updating hardware. – Ant6n Apr 17 '15 at 22:48
4

8 bytes per key/value pair would be pretty difficult under any implementation, Python or otherwise. If you don't have a guarantee that the keys are contiguous then either you'd waste a lot of space between the keys by using an array representation (as well as needing some sort of dead value to indicate a null key), or you'd need to maintain a separate index to key/value pairs which by definition would exceed your 8 bytes per pair (even if only by a small amount).

I suggest you go with your array method, but the best approach will depend on the nature of the keys I expect.

Kylotan
  • 18,290
  • 7
  • 46
  • 74
  • Thanks for your input! My trick was to split the key domain into a number of arrays, and keep the key-value pairs sorted within each array. That way both reads (binary search) and writes (shift and insert) are relatively cheap. – Bolo Dec 29 '10 at 22:17
3

How about a Judy array if you're mapping from ints? It is kind of a sparse array... Uses 1/4th of the dictionary implementation's space.

Judy:

$ cat j.py ; time python j.py 
import judy, random, sys
from guppy import hpy
random.seed(0)
h = hpy()
h.setrelheap()
d = judy.JudyIntObjectMap()
for _ in xrange(4000000):
    d[random.randint(0, sys.maxint)] = random.randint(0, sys.maxint)

print h.heap()
Partition of a set of 4000004 objects. Total size = 96000624 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0 4000001 100 96000024 100  96000024 100 int
     1      1   0      448   0  96000472 100 types.FrameType
     2      1   0       88   0  96000560 100 __builtin__.weakref
     3      1   0       64   0  96000624 100 __builtin__.PyJudyIntObjectMap

real    1m9.231s
user    1m8.248s
sys     0m0.381s

Dictionary:

$ cat d.py ; time python d.py   
import random, sys
from guppy import hpy
random.seed(0)
h = hpy()
h.setrelheap()
d = {}
for _ in xrange(4000000):
    d[random.randint(0, sys.maxint)] = random.randint(0, sys.maxint)

print h.heap()
Partition of a set of 8000003 objects. Total size = 393327344 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0      1   0 201326872  51 201326872  51 dict (no owner)
     1 8000001 100 192000024  49 393326896 100 int
     2      1   0      448   0 393327344 100 types.FrameType

real    1m8.129s
user    1m6.947s
sys     0m0.559s

~1/4th the space:

$ echo 96000624 / 393327344 | bc -l
.24407309958089260125

(I'm using 64bit python, btw, so my base numbers may be inflated due to 64bit pointers)

rrauenza
  • 6,285
  • 4
  • 32
  • 57
2

Looking at your data above, that's not 49 bytes per int, it's 25. The other 24 bytes per entry are the int objects themselves. So you need something that is significantly smaller than 25 bytes per entry. Unless you also are going to reimplement the int objects, which is possible for the key hashes, at least. Or implement it in C, where you can skip the objects completely (this is what Zopes IIBTree does, mentioned above).

To be honest the Python dictionary is highly tuned in various ways. It will not be easy to beat it, but good luck.

Lennart Regebro
  • 167,292
  • 41
  • 224
  • 251
  • Thanks for your valuable answer. True, the 24 bytes are for the two int objects – but with `arrays` you can skip the objects and trim that size down to 8 bytes. I didn't know that `IIBTree` stores "primitive" ints as well: neither the documentation nor gnibbler's answer mentioned that. Thanks for this clarification! Although the Python dictionary is indeed well-tuned, what I need is an implementation which is maximally optimized for space, at the expense of time (ideally, for the sake of portability: written in Python, not C, and usable without a big external library like Zope). – Bolo Dec 29 '10 at 21:48
  • 1
    @Bolo: It's usable without Zope, but not without ZODB, of which it is a part. It's much faster, but seems to use slightly more memory than your solution, from what I can gather of testing (the guppy heap printout is useless here, probably because the ints are allocated in C). It's of course made to store BTrees in the ZODB and is also used by the ZODB itself. – Lennart Regebro Dec 29 '10 at 23:12
1

I have implemented my own int-int dictionary, available here (BSD license). In short, I use array.array('i') to store key-value pairs sorted by keys. In fact, instead of one large array, I keep a dictionary of smaller arrays (a key-value pair is stored in the key/65536th array) in order to speed up shifting during insertion and binary search during retrieval. Each array stores the keys and values in the following way:

key0 value0 key1 value1 key2 value2 ...

Actually, it is not only an int-int dictionary, but a general object-int dictionary with objects reduced to their hashes. Thus, the hash-int dictionary can be used as a cache of some persistently stored dictionary.

There are three possible strategies of handling "key collisions", that is, attempts to assign a different value to the same key. The default strategy allows it. The "deleting" removes the key and marks it as colliding, so that any further attempts to assign a value to it will have no effect. The "shouting" strategy throws an exception during any overwrite attempt and on any further access to any colliding key.

Please see my answer to a related question for a differently worded description of my approach.

Community
  • 1
  • 1
Bolo
  • 11,542
  • 7
  • 41
  • 60
  • It's also around 10 times slower than a standard dict, when inserting one million random integers using another million random integers as keys. And 20 times as slow on retrieve. It would be interesting to know in what usecase that is OK. :) – Lennart Regebro Dec 29 '10 at 08:21
  • 1
    @Lennart Every couple of months I need to analyze a large graph (interwiki links in Wikipedia): ~30M nodes, each being a tuple and identified by a string, probably ~250M links, and rapidly growing. The graph is stored in a PostgreSQL DB, but I wanted a significantly faster access, so I've tried to fit all the entire graph in RAM. Due to hash collisions, I occasionally have to hit the DB anyway, but that's OK. And the time-memory trade-off that you have mentioned is acceptable here, since I still get my data *much* faster than from a DB. – Bolo Dec 29 '10 at 11:08