Embarassingly parallel tasks with IPython Parallel (or other package) depending on unpickable objects

Question

I often hit problems where I wanna do a simple stuff over a set of many, many objects quickly. My natural choice is to use IPython Parallel for its simplicity, but often I have to deal with unpickable objects. After trying for a few hours I usually resign myself to running my taks overnight on a single computer, or do a stupid thing like dividing things semi-manually in to run in multiple python scripts.

To give a concrete example, suppose I want to delete all keys in a give S3 bucket.

What I'd normally do without thinking is:

import boto
from IPython.parallel import Client

connection = boto.connect_s3(awskey, awssec)
bucket = connection.get_bucket('mybucket')

client = Client()
loadbalancer = c.load_balanced_view()

keyList = list(bucket.list())
loadbalancer.map(lambda key: key.delete(), keyList)

The problem is that the Key object in boto is unpickable (*). This occurs very often in different contexts for me. It's a problem also with multiprocessing, execnet, and all other frameworks and libs I tried (for obvious reasons: they all use the same pickler to serialize the objects).

Do you guys also have those problems? Is there a way I can serialize these more complex objects? Do I have to write my own pickler for this particular objects? If I do, how do I tell IPython Parallel to use it? How do I write a pickler?

Thanks!

(*) I'm aware that I can simply make a list of the keys names and do something like this:

loadbalancer.map(lambda keyname: getKey(keyname).delete())

and define the getKey function in each engine of the IPython cluster. This is just a particular instance of a more general problem that I find often. Maybe it's a bad example, since it can be easily solved in another way.

I'm sure the tasks are embarrassed enough without you making fun of them on SO! — Phillip Berger, Mar 05 '13 at 15:23

score 2 · Answer 1 · edited May 23 '17 at 10:24

2

IPython has a use_dill option, where if you have the dill serializer installed, you can serialize most "unpicklable" objects.

How can I use dill instead of pickle with load_balanced_view

edited May 23 '17 at 10:24

Community

1
1

answered Oct 19 '14 at 11:18

Mike McKerns

33,715
8
119
139

Alex S · Answer 2 · 2013-11-04T13:37:16.203

That IPython sure brings people together ;). So from what I've been able to gather, the problem with pickling objects are their methods. So maybe instead of using the method of key to delete it you could write a function that takes it and deletes it. Maybe first get a list of dict's with the relevant information on each key and then afterwards call a function delete_key( dict ) which I leave up to you to write because I've no idea how to handle s3 keys.

Would that work?

Alternatively, it could be that this works: simply instead of calling the method of the instance, call the method of the class with the instance as an argument. So instead of lambda key : key.delete() you would do lambda key : Key.delete(key). Of course you have to push the class to the nodes then, but that shouldn't be a problem. A minimal example:

 class stuff(object):
       def __init__(self,a=1):
            self.list = []
       def append(self, a):
            self.list.append(a)

  import IPython.parallel as p
  c = p.Client()
  dview = c[:]

  li = map( stuff, [[]]*10 ) # creates 10 stuff instances

  dview.map( lambda x : x.append(1), li ) # should append 1 to all lists, but fails

  dview.push({'stuff':stuff}) # push the class to the engines
  dview.map( lambda x : stuff.append(x,1), li ) # this works.

Embarassingly parallel tasks with IPython Parallel (or other package) depending on unpickable objects

2 Answers2