Large memory Python background jobs

Question

I am running a Flask server which loads data into a MongoDB database. Since there is a large amount of data, and this takes a long time, I want to do this via a background job.

I am using Redis as the message broker and Python-rq to implement the job queues. All the code runs on Heroku.

As I understand, python-rq uses pickle to serialise the function to be executed, including the parameters, and adds this along with other values to a Redis hash value.

Since the parameters contain the information to be saved to the database, it quite large (~50MB) and when this is serialised and saved to Redis, not only does it take a noticeable amount of time but it also consumes a large amount of memory. Redis plans on Heroku cost $30 p/m for 100MB only. In fact I every often get OOM errors like:

OOM command not allowed when used memory > 'maxmemory'.

I have two questions:

Is python-rq well suited to this task or would Celery's JSON serialisation be more appropriate?
Is there a way to not serialise the parameter but rather a reference to it?

Your thoughts on the best solution are much appreciated!

Redis/rabbitmq is a communication layer between two processes. The reason many task queues use pickle is because they need to pass whole objects. Pickle is good at this. A potential solution is to serialize your data to a json file and just pass your worker where the file is and then redis won't need to store these huge pickles in memory. I haven't done this personally so I thought a comment would be better, but my intuition says as long as you can serialize it, it should work. — arlyon, Jan 04 '17 at 10:50
If you are able to serialize and save it to a json then python-rq should be fine. — arlyon, Jan 04 '17 at 10:54
Hey @arlyon - I think python-rq can only serialise via pickle. But yes, saving to /tmp in the ephemeral file system and passing the location could be a good solution. Thanks for the brainstorm! — WillJones, Jan 04 '17 at 12:53
Can you show me what sort of task calls you're making currently? EG: What data are you passing into your task? — rdegges, Jan 04 '17 at 14:12
@rdegges thanks for the reply. The input values are a large list of key-value pairs in a list, ready to be inserted into a mongodb via `.insert_many()`. — WillJones, Jan 04 '17 at 17:05

score 7 · Accepted Answer · answered Jan 04 '17 at 17:21

Since you mentioned in your comment that your task input is a large list of key value pairs, I'm going to recommend the following:

Load up your list of key/value pairs in a file.
Upload the file to Amazon S3.
Get the resulting file URL, and pass that into your RQ task.
In your worker task, download the file.
Parse the file line-by-line, inserting the documents into Mongo.

Using the method above, you'll be able to:

Quickly break up your tasks into manageable chunks.
Upload these small, compressed files to S3 quickly (use gzip).
Greatly reduce your redis usage by requiring much less data to be passed over the wires.
Configure S3 to automatically delete your files after a certain amount of time (there are S3 settings for this: you can have it delete automatically after 1 day, for instance).
Greatly reduce memory consumption on your worker by processing the file one line at-a-time.

For use cases like what you're doing, this will be MUCH faster and require much less overhead than sending these items through your queueing system.

Hope this helps!

This is exactly what I ended up doing – WillJones Jan 06 '17 at 14:40 — WillJones, Jan 06 '17 at 14:40

score 0 · Answer 2 · answered Jan 04 '17 at 16:56

0

It turns out that the solution that worked for is to save the data to Amazon S3 storage, and then pass the URI to function in the background task.

answered Jan 04 '17 at 16:56

WillJones

907
1
9
19

Large memory Python background jobs

2 Answers2