1

Each of my mappers need access to very large dictionary. Is there someway I can avoid the overhead of each mapper opening its own copy, and instead have all of them point to one global shared object?

Any suggestions specific to DISCO or in mapreduce paradigm would be helpful.

rupen
  • 11
  • 1
  • I saw someone use: `global my_dict` `if my_dict not in globals():` `my_dict = load_dict()` but i am not sure if this will actually work.. need to test it – rupen Apr 17 '14 at 02:00
  • I am thinking .. maybe discodb might be what I am looking for.. Its documentation says - (a) " In contrast to Python’s builtin dict object, DiscoDB can handle tens of millions of key-value pairs without consuming gigabytes of memory." (b) "The benefit of this is that after they have been persisted, instantiating them from disk and key lookups are lightning-fast operations, thanks to memory mapping." any thoughts? – rupen Apr 17 '14 at 04:06

1 Answers1

0

Use Redis key-value store

Can be installed quickly on Linux and Windows compiled versions are also available.

python redis package will then allow you to write, read and update values really easily.

Using hash data type is what will serve you best, you may add/edit new values to so called fields (key in Python dictionary terminology), it is very fast and it is also very straightforward.

This solution will work even for independent processes. You may even share data in Redis over network, so for map/reduce scenario this can be great option.

The only thing, you have to care about when storing and restoring values is that the values can be only strings, so you have to serialize and de-serialize them. json.dumps and json.loads works very well for this.

Community
  • 1
  • 1
Jan Vlcinsky
  • 42,725
  • 12
  • 101
  • 98