Python defaultdict for large data sets

Question

I am using defaultdict to store millions of phrases, so my data structure looks like mydict['string'] = set(['other', 'strings']). It seems to work ok for smaller sets but when I hit anything over 10 million keys, my program just crashes with the helpful message of Process killed. I know defaultdicts are memory heavy, but is there an optimised method of storing using defaultdicts or would I have to look at other data structures like numpy array?

Thank you

numpy array instead of defaultdict? instead of set? I don't see how this would work for the first case or how you would be better off in the second -- set is going to be way faster than numpy array for set-like operations. — hughdbrown, Aug 03 '14 at 19:02
Whatever memory reduction you obtain will bomb again when you get to 20 million keys (or 30M, etc.). It sure is convenient to keep everything in core, but you'll probably outgrow core. You or your successor will hate you less in the future if you move your storage to a proper DBMS. — msw, Aug 03 '14 at 19:08
Thank you for the replies, I realised as I wanted to reply back how to deal with this issue. This large dataset was meant to be used as a lookup for smaller dataset, I could just reverse the logic. DBMS would have been a better solution otherwise — Lezan, Aug 03 '14 at 23:43
Maybe try a trie (I don't think these are in the standard library,. but there are many implementations available)? But only if there is significant overlap between your dictionary keys. sets probably have overhead similar to a dictionary - may try replacing them with a tuple if you have few members. — user1277476, Oct 01 '14 at 23:02

score 5 · Accepted Answer · answered Oct 10 '14 at 03:29

If you're set on staying in memory with a single Python process, then you're going to have to abandon the dict datatype -- as you noted, it has excellent runtime performance characteristics, but it uses a bunch of memory to get you there.

Really, I think @msw's comment and @Udi's answer are spot on -- to scale you ought to look at on-disk or at least out-of-process storage of some sort, probably an RDBMS is the easiest thing to get going.

However, if you're sure that you need to stay in memory and in-process, I'd recommend using a sorted list to store your dataset. You can do lookups in O(log n) time and insertions and deletions in constant time, and you can wrap up the code for yourself so that the usage looks pretty much like a defaultdict. Something like this might help (not debugged beyond tests at the bottom):

import bisect

class mystore:
    def __init__(self, constructor):
        self.store = []
        self.constructor = constructor
        self.empty = constructor()

    def __getitem__(self, key):
        i, k = self.lookup(key)
        if k == key:
            return v
        # key not present, create a new item for this key.
        value = self.constructor()
        self.store.insert(i, (key, value))
        return value

    def __setitem__(self, key, value):
        i, k = self.lookup(key)
        if k == key:
            self.store[i] = (key, value)
        else:
            self.store.insert(i, (key, value))

    def lookup(self, key):
        i = bisect.bisect(self.store, (key, self.empty))
        if 0 <= i < len(self.store):
            return i, self.store[i][0]
        return i, None

if __name__ == '__main__':
    s = mystore(set)
    s['a'] = set(['1'])
    print(s.store)
    s['b']
    print(s.store)
    s['a'] = set(['2'])
    print(s.store)

Thank you for your answer, I ended up using sets and intersection but this is a valid solution as well — Lezan, Oct 10 '14 at 23:21
Interesting, I thought `set`s used `dict` as the underlying data store, so I would have thought you'd run into the same type of problem there. Time for me to RTFM :) — lmjohns3, Oct 11 '14 at 00:42

score 2 · Answer 2 · answered Oct 01 '14 at 21:45

2

Maybe try redis' set data type:

Redis Sets are unordered collections of strings. The SADD command adds new elements to a set. It's also possible to do a number of other operations against sets like testing if a given element already exists...

From here: http://redis.io/topics/data-types-intro

redis-py supports these commands.

answered Oct 01 '14 at 21:45

Udi

29,222
9
96
129

Thank you, I ended up using sets, redis sets is something useful to look into. Cheers! – Lezan Oct 10 '14 at 23:23

Python defaultdict for large data sets

2 Answers2