5

I have a python script that is working with a large (~14gb) textfile. I end up with a dictionary of keys and values, but I am getting a memory error when I try to sort the dictionary by value.

I know the dictionary is too big to load into memory and then sort, but how could I go about accomplishing this?

Joff
  • 11,247
  • 16
  • 60
  • 103

1 Answers1

2

You can use an ordered key/value store like wiredtiger, leveldb, bsddb. All of them support ordered keys using custom sort function. leveldb is the easiest to use but if you use python 2.7, bsddb is included in the stdlib. If you only require lexicographic sorting you can use the raw hashopen function to open a persistent sorted dictionary:

from bsddb import hashopen


db = hashopen('dict.db')
db['020'] = 'twenty'
db['002'] = 'two'
db['value'] = 'value'
db['key'] = 'key'

print(db.keys())

This outputs

>>> ['002', '020', 'key', 'value']

Don't forget to close the db after your work:

db.close()

Mind the fact that hashopen configuration might not suit your need. In this case I recommend you use leveldb which has a simple API or wiredtiger for speed.

To order by value in bsddb, you have to use the composite key pattern or key composition. Which boils down to create a dictionary key which keeps the ordering you look for. In this example we pack the original dict value first (so that small values appears first) with the original dict key (so that the bsddb key is unique):

import struct
from bsddb import hashopen

my_dict = {'a': 500, 'abc': 100, 'foobar': 1}

# insert
db = hashopen('dict.db')
for key, value in my_dict.iteritems():
    composite_key = struct.pack('>Q', value) + key
    db[composite_key] = ''  # value is not useful in this case but required
db.close()


# read
db = hashopen('dict.db')
for key, _ in db.iteritems():  # iterate over database
    size = struct.calcsize('>Q')
    # unpack
    value, key = key[:size], key[size:]
    value = struct.unpack('>Q', value)[0]
    print key, value
db.close()

This outputs the following:

foobar 1
abc 100
a 500
amirouche
  • 7,682
  • 6
  • 40
  • 94
  • Cool Thanks for all the info. I decided to go the route of sqlite since I could write regular sql in python which was easier for me to instantly recognize how to go about it. I am sure these solutions are great to. Thanks for pointing me in the right direction – Joff Mar 28 '16 at 16:06