I often hit problems where I wanna do a simple stuff over a set of many, many objects quickly. My natural choice is to use IPython Parallel for its simplicity, but often I have to deal with unpickable objects. After trying for a few hours I usually resign myself to running my taks overnight on a single computer, or do a stupid thing like dividing things semi-manually in to run in multiple python scripts.
To give a concrete example, suppose I want to delete all keys in a give S3 bucket.
What I'd normally do without thinking is:
import boto
from IPython.parallel import Client
connection = boto.connect_s3(awskey, awssec)
bucket = connection.get_bucket('mybucket')
client = Client()
loadbalancer = c.load_balanced_view()
keyList = list(bucket.list())
loadbalancer.map(lambda key: key.delete(), keyList)
The problem is that the Key
object in boto
is unpickable (*). This occurs very often in different contexts for me. It's a problem also with multiprocessing, execnet, and all other frameworks and libs I tried (for obvious reasons: they all use the same pickler to serialize the objects).
Do you guys also have those problems? Is there a way I can serialize these more complex objects? Do I have to write my own pickler for this particular objects? If I do, how do I tell IPython Parallel to use it? How do I write a pickler?
Thanks!
(*) I'm aware that I can simply make a list of the keys names and do something like this:
loadbalancer.map(lambda keyname: getKey(keyname).delete())
and define the getKey
function in each engine of the IPython cluster. This is just a particular instance of a more general problem that I find often. Maybe it's a bad example, since it can be easily solved in another way.