Data model and datastore technology for fast multidimensional data lookup

Question

I have a parent hashmap data structure having a string as key and having hashmap data structures as children (guess child1, child2, ..., childN). Each child is a simple key-value map, having a number as key and a string as value. In pseudo-code:

parent['key1'] = child1;    // child1 is a hash map data structure
child1[0] = 'foo';
child[1] = 'bar';
...

My need is to implement this data structure as a fast lookup table in a database system. Let us take Python as reference language.

Requirements for the solution:

as quick as possible in retrieving the children hasmaps!
the parent hash will have an estimated total weight of at most 500 MB

The use case is the following:

A client Python program queries the datastore for a specific child hash
The datastore returns the child hash
The Python program passes the whole hash to a specific function, extracts a specific value from the hash (it already knows which key to use) and passes it to a second function

Would you recommend an in-memory key-value datastore (such as Redis) or a more classical "relational" database solution? Which data model do you suggest me to use?

Javier · Answer 1 · 2012-10-10T23:30:28.273

3

Absolutely go with Redis. Not only it is really fast, but it handles exactly the structure you need: http://redis.io/commands#hash

In your case, you could avoid reading the whole 'child hash', since the client "extracts a specific value from the hash (it already knows which key to use)"

redis> HMSET myhash field1 "Hello" field2 "World"
OK
redis> HGET myhash field1
"Hello"
redis> HGET myhash field2
"World"

or, if you do want the whole hash:

redis> HGETALL myhash
1) "field1"
2) "Hello"
3) "field2"
4) "World"
redis>

of course, using a client library gives the result right in a workable object, in your case, a Python dictionary.

edited Oct 10 '12 at 23:30

answered Oct 10 '12 at 20:02

Javier

60,510
8
78
126

Thank you for your answer. I poorly described step 3 of my use case (now it's fixed), in fact I want the _whole_ child hash to be read and handled by my Python code. Does Redis still suit? – csparpa Oct 10 '12 at 20:35
Redis will still be good, but it would be faster to save the child hash as a serialized/pickled string, and deserialize/unpickle it on the client after reading. Also make sure to get `hiredis` installed for your respective client (i.e. http://pypi.python.org/pypi/hiredis) which will enable faster data conversion from redis to python and vice versa. – Nisan.H Oct 10 '12 at 23:52

Nisan.H · Accepted Answer · 2012-10-11T21:02:05.840

Sample code using redis-py, assuming you already have Redis (and ideally hiredis) installed, saving each parent as a hash field, with the children as serialized strings, and handling serialization and deserialization on the client side:

JSON version:

## JSON version
import json 
# you could use pickle instead, 
# just replace json.dumps/json.loads with pickle/unpickle

import redis

# set up the redis client
r = redis.StrictRedis(host = '', port = 6379, db = 0)

# sample parent dicts
parent0 = {'child0': {0:'a', 1:'b', 2:'c',}, 'child1':{5:'e', 6:'f', 7:'g'}}
parent1 = {'child0': {0:'h', 1:'i', 2:'j',}, 'child1':{5:'k', 6:'l', 7:'m'}}

# save the parents as hashfields, with the children as serialized strings
# bear in mind that JSON will convert the int keys to strings in the dumps() process
r.hmset('parent0', {key: json.dumps(parent0[key]) for key in parent0})
r.hmset('parent1', {key: json.dumps(parent0[key]) for key in parent1})


# Get a child dict from a parent
# say child1 of parent0
childstring = r.hget('parent0', 'child1') 
childdict = json.loads(childstring) 
# this could have been done in a single line... 

# if you want to convert the keys back to ints:
for key in childdict.keys():
    childdict[int(key)] = childdict[key]
    del childdict[key]

print childdict

pickle version:

## pickle version
# For pickle, you need a file-like object. 
# StringIO is the native python one, whie cStringIO 
# is the c implementation of the same.
# cStringIO is faster
# see http://docs.python.org/library/stringio.html and
# http://www.doughellmann.com/PyMOTW/StringIO/ for more information
import pickle
# Find the best implementation available on this platform
try:
    from cStringIO import StringIO
except:
    from StringIO import StringIO

import redis

# set up the redis client
r = redis.StrictRedis(host = '', port = 6379, db = 0)

# sample parent dicts
parent0 = {'child0': {0:'a', 1:'b', 2:'c',}, 'child1':{5:'e', 6:'f', 7:'g'}}
parent1 = {'child0': {0:'h', 1:'i', 2:'j',}, 'child1':{5:'k', 6:'l', 7:'m'}}

# define a class with a reusable StringIO object
class Pickler(object):
    """Simple helper class to use pickle with a reusable string buffer object"""
    def __init__(self):
        self.tmpstr = StringIO()

    def __del__(self):
        # close the StringIO buffer and delete it
        self.tmpstr.close()
        del self.tmpstr

    def dump(self, obj):
        """Pickle an object and return the pickled string"""
        # empty current buffer
        self.tmpstr.seek(0,0)
        self.tmpstr.truncate(0)
        # pickle obj into the buffer
        pickle.dump(obj, self.tmpstr)
        # move the buffer pointer to the start
        self.tmpstr.seek(0,0)
        # return the pickled buffer as a string
        return self.tmpstr.read()

    def load(self, obj):
        """load a pickled object string and return the object"""
        # empty the current buffer
        self.tmpstr.seek(0,0)
        self.tmpstr.truncate(0)
        # load the pickled obj string into the buffer
        self.tmpstr.write(obj)
        # move the buffer pointer to start
        self.tmpstr.seek(0,0)
        # load the pickled buffer into an object
        return pickle.load(self.tmpstr)


pickler = Pickler()

# save the parents as hashfields, with the children as pickled strings, 
# pickled using our helper class
r.hmset('parent0', {key: pickler.dump(parent0[key]) for key in parent0})
r.hmset('parent1', {key: pickler.dump(parent1[key]) for key in parent1})


# Get a child dict from a parent
# say child1 of parent0
childstring = r.hget('parent0', 'child1') 
# this could be done in a single line... 
childdict = pickler.load(childstring) 

# we don't need to do any str to int conversion on the keys.

print childdict

the 'keys as strings' is a JSON limitation, not Redis. messagePack seems to allow any type as key, and pickle would of course handle any pickleable Python object. — Javier, Oct 11 '12 at 16:39
@Javier thanks for the correction, I updated the code to reflect this. — Nisan.H, Oct 11 '12 at 17:46

score 0 · Answer 3 · edited May 23 '17 at 12:29

0

After a quick search based on Javier hint, I came up with this solution: I could implement a single parent hash in Redis, where the value fields will be the string representation of the children hashes. In this way I can quickly read them and evaluate them from the Python program.

Just to make an example, my Redis data structure will be similar to:

//write a hash with N key-value pairs: each value is an M key-value pairs hash
redis> HMSET parent_key1 child_hash "c1k1:c1v1, c1k2:c1v2, [...], c1kM:c1vM"
  OK
redis> HMSET parent_key2 child_hash "c2k1:c2v1, c2k2:c2v2, [...], c2kM:c2vM"
  OK
[...]
redis> HMSET parent_keyN child_hash "cNk1:cNv1, cNk2:cNv2, [...], cNkM:cNvM"
  OK

//read data
redis> HGET parent_key1 child_hash
  "c1k1:c1v1, c1k2:c1v2, [...], c1kM:c1vM"

Then my Python code just needs to use Redis bindings to query for the desired child hashes and have returned their actual string representations; what is left to do is turn the string representations into the corresponding dictionaries, which can therefore be looked-up at convenience.

Example code (as suggested in this answer):

>>> import ast
>>> # Redis query:
>>> #   1. Setup Redis bindings
>>> #   2. Ask for value at key: parent_key1
>>> #   3. Store the value to 's' string
>>> dictionary = ast.literal_eval('{' + s + '}')
>>> d
{c1k1:c1v1, c1k2:c1v2, [...], c1kM:c1vM}

Hope I'm not missing anything!

edited May 23 '17 at 12:29

Community

1
1

answered Oct 10 '12 at 21:30

csparpa

511
2
7
19

well, you can use this object marshalling to store strings, but if you want speed, you should store each Python hash on a Redis hash. The usual idiom when you need more than a single keyspace is to concatenate keys: `HMSET "parentkey1:childkeyX" f1 v1 f2 v2 f3 v3`. This allows Redis to optimize small hashes with ziplists (more compact and faster for less than a hundred or so fields) – Javier Oct 10 '12 at 23:36
or, if you really want to store your objects as strings, you can use pickle, since Redis strings are 8-bit clean. but if you want to do some server-side processing, then consider JSON or messagePack, as both can be decoded by the embedded Lua engine. of course, if you don't string-marshall your objects (as described on the previous comment), it's even easier. – Javier Oct 10 '12 at 23:39
Ditto what @Javier said. The fastest retrieval time would be achieved by saving and reading the child hashes as serialized or pickled strings. The internal overhead with a large number of individual key:value pair access times on Redis can grow quite large eventually (120,000 keys in a list or across several hashfields will take ~400ms, vs. ~<10ms for the serialized string of the same data.) For smaller sets it's probably faster to store the data as hashfields. If you could just get the child's required members, that would be the best. – Nisan.H Oct 10 '12 at 23:45
I was already thinking to store the children hashes in a serialized form, but I had doubts that the serialization/deserialization time would be greater than the retrieval time that Redis could grant me...but as my parent hash will store about 500000 keys, Redis performances will quickly degrade and so serialization becomes quite a must. I'll then go with redis-py+hiredis and serialize/deserialize with json.dumps/json.loads -> thank you both for your precious help! – csparpa Oct 11 '12 at 08:56
In that case you might want to use the pickle version to speed things up a bit. Although the JSON version does have the benefit of still being readable when stored in Redis. I updated my answer to also have a pickle sample. – Nisan.H Oct 11 '12 at 18:33
@claudiosparpaglione i don't get why you want your own 'parent hash' instead of using the main Redis space. Just add any prefix to your object id to create the full key, and store a single object on each Redis hash object. As said before, this allows Redis to apply quite effective small-object optimizations (and to retrieve single fields when appropriate) – Javier Oct 11 '12 at 18:35
Thanks for the update @Nisan.H! Javier, I have about 500000 hashes to store: in your opinion, using the main Redis space (and thus saving 500000 plain hashes as values each one having a composite (prexifed) key) can lead to good lookup performance if compared to the serialization scenario? Any suggestion is welcome, as I don't have any empirical proof (I'm not going to implement both the cases just for benchmarking purposes). Thanks in advance! – csparpa Oct 11 '12 at 21:00
@claudiosparpaglione you're welcome. As for the question: Redis already implements a key:value type hash map lookup mechanism for its internal data storage. So if you only need to retrieve _small_ amounts of data each time, having it as close to the top level as possible will reduce the overall time spent _retrieving_ data, thus improving performance. If you need to get large data from Redis, then you would want to optimize for minimal key:value access operations (e.g. stream more data over fewer key accesses, to reduce lookup and access overhead.) – Nisan.H Oct 11 '12 at 21:05
Specifically, you could save the children objects as top-level hash fields using a concatenated . string as the key, and the child's content as the hash field members: `parent0.child3':{:, :,...}` and then retrieve either the child as a hashfield using `hgetall `, or just the required members using `hmget [, , ...]` where in both cases field name would be something like `parent3.child5`. – Nisan.H Oct 11 '12 at 21:07
@claudiosparpaglione: yes. several million keys on the main Redis space is no problem at all, and MUCH better than storing serialized hashes. when Redis detects that a given object (a hash in your case) is small enough, it will apply several optimizations that save _lots_ of space without sacrificing performance, especially for the `HGETALL "parent:child"` command. That's why it's so common to simply concatenate keys (typically separated by ':') to create a single key for Redis. – Javier Oct 18 '12 at 15:56

Data model and datastore technology for fast multidimensional data lookup

3 Answers3