Consuming the pickle binary format from non-python (with celery and rabbitmq)

Question

I'm using Python, Celery and RabbitMQ to produce messages from loosely coupled systems. However, I'm worried about interoperability.

When inspecting the message payload directly from RabbitMQ, that is produced by celery, I get the following binary format:

Binary version of celery output

I strongly suspect that this is a binary pickle format. However, I'm having trouble finding information on the binary pickle format in general.

So, I really have a few questions:

Is this a binary pickle format?
What resources are available to map out the binary format?
Given that celery does, in fact, produce pickled data, what options are available to me if I want to consume those messages from non-python consumers (such as c++ or php)?
Do you have any experiences of working with Celery, RabbitMQ and interoperating with other consumers which are not python. Do you have any advice regarding that subject?

Thanks in advance...

UPDATE:

Based on Brendan's recommendation, I've switched this to a JSON serializer with:

add.apply_async(args=[10, 10], serializer="json")

For reference for future searchers, it appears that the JSON format, in this specific, empty case, is about 15% larger (or 28 bytes):

JSON serialized version from celery

Also, for people that might be interested in reading the pickle format from c++, I found this question helpful: How can I read a python pickle database/file from C?

UPDATE 2:

Based on Asksol's recommendation, I tried out the zlib compression with:

async_result = add.apply_async( (x, y), compression='zlib' )

I thought there were some interesting results, so here they are:

Format comparison table

As you can see in this example, the Pickle format is smaller than JSON. However, when compression is added to the mix, compressed JSON is actually smaller than either version of Pickle. I'm also curious about the parse times of either format. While JSON was designed to parser performant, Pickle is based on offsets, which means it wouldn't have to be iterated through. I wonder if anyone has done any performance benchmarks on the two formats, with and without compressions, and taking parsing CPU time into account.

simplejson is pretty fast, afair it wasn't much faster than pickle. The yajl and cjson libs are faster but is broken in a number of places. (e.g. yajl can't handle float timestamps). — asksol, Aug 31 '12 at 09:09
btw, you could also bring msgpack into this, not sure how it performs. — asksol, Aug 31 '12 at 09:10

score 5 · Accepted Answer · answered Aug 29 '12 at 18:47

5

According to the documentation, you can make Celery use JSON instead. I'd recommend doing that since it's pretty standard, no matter what language you use. If you use a lot of binary data, it might increase the size of the messages though.

Data transferred between clients and workers needs to be serialized. The default serializer is pickle, but you can change this globally or for each individual task. There is built-in support for pickle, JSON, YAML and msgpack, and you can also add your own custom serializers by registering them into the Kombu serializer registry (see Kombu: Serialization of Data).

answered Aug 29 '12 at 18:47

Brendan Long

53,280
21
146
188

2

You can also enable compression: http://docs.celeryproject.org/en/latest/userguide/calling.html#calling-compression – asksol Aug 30 '12 at 10:50
Thanks @asksol, I've added compression to the examples and posted a comparison chart (with some interesting results). And thank you, too, for writing celery. Cheers. – Homer6 Aug 30 '12 at 21:15

score 2 · Answer 2 · answered Aug 29 '12 at 18:51

From the example of the pickletools module, I infer that this is indeed a pickle stream.
The format is not exactly documented. There are several versions in fact. But you can use the pickletools script (see above) for analyzing pickle files.
You cannot consume pickle'd data from other languages. The format is highly Python specific and in fact executes Python code (at the very least, object construction).
I have not. It appears Brendan Long has found a solution. You'll still need some dedicated code to parse the JSON messages at the other end (especially if you need to transfer any complicated structures), but it shouldn't be too hard (possibly fragile though).

for simple data structures you can certainly pickout the key value pairs ... but much more than that and you are absolutely right — Joran Beasley, Aug 29 '12 at 18:58
This article looks good too: http://stackoverflow.com/questions/1296162/how-can-i-read-a-python-pickle-database-file-from-c — Homer6, Aug 29 '12 at 20:40

score 0 · Answer 3 · answered Aug 29 '12 at 18:52

0

By default Celery uses pickle to serialize messages.

http://celery.github.com/celery/userguide/calling.html#calling-serializers

You can change the serializer to json or jaml if you want to use a text-based serialization.

answered Aug 29 '12 at 18:52

mher

10,508
2
35
27

Thank you for your contribution – Homer6 Aug 29 '12 at 19:08

Consuming the pickle binary format from non-python (with celery and rabbitmq)

3 Answers3