4

I would like to implement a job queue in Mongo. The entire software system is based around Mongo so it seems natural and potentially a good fit.

The jobs collection stores each job state as a document. I imagine this to be an uncapped collection based of my query needs. The jobs documents look like the following:

{
    "_id" : ObjectId("50a6742ee4b0a9a1c2cb4fd4"),
    "type" : "archive_job",
    "state" : 2,
    "priority" : 1,
    "timing" : {
        "submitted": ISODate(...),
        "running": ISODate(...),
        "completed": ISODate(...),
        "failed": null,
        "cancelled": null
    },
    payload: {
       ...job-specific JSON...
    }
}

The typical access patterns for the jobs collection will be:

  • find unprocessed jobs to execute based on type, state, priority and possibly a range query on timing.submitted greater than the previous read time
  • find all processed (completed, failed, cancelled) jobs
  • find all unprocessed (submitted, running) jobs
  • find specific job by _id and retrieve its payload (when state is running)

The bulk of the queries will be to find unprocessed jobs that need execution. Would it be worth while to move payload to a jobs_payload collection so the document size does not vary greatly in the jobs collection?

Will the large amount of processed (completed, failed, cancelled), versus unprocessed jobs, eventually increase the working set memory required for the jobs collections? Will the access times for unprocessed jobs to execute be slower even with the right indices?

What are my alternatives and trade-offs I can make with the schema design?

abargnesi
  • 341
  • 4
  • 13

1 Answers1

-1

Would it be worth while to move payload to a jobs_payload collection so the document size does not vary greatly in the jobs collection?

Generally embedding is a right way in mongodb, in your case it looks fine.

Will the large amount of processed (completed, failed, cancelled), versus unprocessed jobs, eventually increase the working set memory required for the jobs collections? Will the access times for unprocessed jobs to execute be slower even with the right indices?

While the database fit in memory slowdown won't be noticed.

You schema looks OK. As an example you can look at celery(which has a mongodb backend) schema.

g1zmo
  • 487
  • 3
  • 7
  • I agree that embedding documents is emphasized in Mongo. In my case won't embedding the *payload* field increase the storage of each document in the **jobs** collection thereby decreasing read times? – abargnesi Nov 18 '12 at 00:46
  • Embedding in your case will not significantly decrease read time(you can read subset of needed fields), but beyond all doubt decrease number of queries. – g1zmo Nov 18 '12 at 08:25
  • Embedding will increase the Document size. But I am guessing, that the variation is not really large isn't it? (not 1MB to 5MB) The working Set would not increase if you not want to query the processed data often. Otherwise it is part of the working set. Tip, you could leverage the _id ObjectId as submitted time, if you want to save some data. – Marc Nov 19 '12 at 22:49
  • The variation would not be that large. I anticipate a payload to vary between 300 bytes - 3 kilobytes; job payload will mostly contain RESTful URIs to other data retrieved when a job begins to run. Regarding the working set are you saying it will not increase if the processed documents are not queried? Are there ways to guarantee this behavior through indices or field selections? Or is it worth separating processed documents into another collection? Also can you do range queries against the create date of an ObjectId? – abargnesi Nov 20 '12 at 11:24
  • If you want to do range queries on that date I would stick to ISODate. A working set is the Set of Data you want to have in RAM / are queried. Like active users. So data which is not queried is not part of the working set. On the other hand, an infinite growing collection might not be the perfect solution. So a batch process (which will cause page faulting) to put old data somewhere else, might be the next step. – Marc Dec 04 '12 at 15:00