5

I am running a Flask server which loads data into a MongoDB database. Since there is a large amount of data, and this takes a long time, I want to do this via a background job.

I am using Redis as the message broker and Python-rq to implement the job queues. All the code runs on Heroku.

As I understand, python-rq uses pickle to serialise the function to be executed, including the parameters, and adds this along with other values to a Redis hash value.

Since the parameters contain the information to be saved to the database, it quite large (~50MB) and when this is serialised and saved to Redis, not only does it take a noticeable amount of time but it also consumes a large amount of memory. Redis plans on Heroku cost $30 p/m for 100MB only. In fact I every often get OOM errors like:

OOM command not allowed when used memory > 'maxmemory'.

I have two questions:

  1. Is python-rq well suited to this task or would Celery's JSON serialisation be more appropriate?
  2. Is there a way to not serialise the parameter but rather a reference to it?

Your thoughts on the best solution are much appreciated!

WillJones
  • 907
  • 1
  • 9
  • 19
  • Redis/rabbitmq is a communication layer between two processes. The reason many task queues use pickle is because they need to pass whole objects. Pickle is good at this. A potential solution is to serialize your data to a json file and just pass your worker where the file is and then redis won't need to store these huge pickles in memory. I haven't done this personally so I thought a comment would be better, but my intuition says as long as you can serialize it, it should work. – arlyon Jan 04 '17 at 10:50
  • If you are able to serialize and save it to a json then python-rq should be fine. – arlyon Jan 04 '17 at 10:54
  • Hey @arlyon - I think python-rq can only serialise via pickle. But yes, saving to /tmp in the ephemeral file system and passing the location could be a good solution. Thanks for the brainstorm! – WillJones Jan 04 '17 at 12:53
  • Can you show me what sort of task calls you're making currently? EG: What data are you passing into your task? – rdegges Jan 04 '17 at 14:12
  • @rdegges thanks for the reply. The input values are a large list of key-value pairs in a list, ready to be inserted into a mongodb via `.insert_many()`. – WillJones Jan 04 '17 at 17:05

2 Answers2

7

Since you mentioned in your comment that your task input is a large list of key value pairs, I'm going to recommend the following:

  • Load up your list of key/value pairs in a file.
  • Upload the file to Amazon S3.
  • Get the resulting file URL, and pass that into your RQ task.
  • In your worker task, download the file.
  • Parse the file line-by-line, inserting the documents into Mongo.

Using the method above, you'll be able to:

  • Quickly break up your tasks into manageable chunks.
  • Upload these small, compressed files to S3 quickly (use gzip).
  • Greatly reduce your redis usage by requiring much less data to be passed over the wires.
  • Configure S3 to automatically delete your files after a certain amount of time (there are S3 settings for this: you can have it delete automatically after 1 day, for instance).
  • Greatly reduce memory consumption on your worker by processing the file one line at-a-time.

For use cases like what you're doing, this will be MUCH faster and require much less overhead than sending these items through your queueing system.

Hope this helps!

rdegges
  • 32,786
  • 20
  • 85
  • 109
0

It turns out that the solution that worked for is to save the data to Amazon S3 storage, and then pass the URI to function in the background task.

WillJones
  • 907
  • 1
  • 9
  • 19